Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

"intersects"
"within"
"contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.

Unzip the file into into your L3 folder

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

Let’s read the data into geopandas and see what we have.

In [1]: import geopandas as gpd

# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"

# Read the data
In [3]: lulc = gpd.read_file(fp)

# See the first rows
In [4]: lulc.head()
Out[4]: 
  code_12          id remark    area_ha    shape_area  clc_int Landuse                                           geometry
0     112  EU-2024407   None  67.055321  670553.20630      112    URML  POLYGON ((658854.791 6458244.203, 658826.936 6...
1     112  EU-2024418   None  36.452500  364525.00295      112    URML  POLYGON ((663553.865 6459840.806, 663570.622 6...
2     112  EU-2024426   None  33.525145  335251.45070      112    URML  POLYGON ((659006.349 6463680.667, 659031.241 6...
3     112  EU-2024431   None  30.111925  301119.25420      112    URML  POLYGON ((658401.027 6466951.556, 658518.679 6...
4     112  EU-2024434   None  70.979465  709794.64765      112    URML  POLYGON ((658348.310 6467489.990, 658040.310 6...

Okey so we have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

In [5]: import pandas as pd

In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')

In [7]: codes
Out[7]: 
    CLC_CODE                         LABEL1                                           LABEL2                                             LABEL3          RGB
      111            Artificial surfaces                                     Urban fabric                            Continuous urban fabric  230-000-077
      112            Artificial surfaces                                     Urban fabric                         Discontinuous urban fabric  255-000-000
      121            Artificial surfaces       Industrial, commercial and transport units                     Industrial or commercial units  204-077-242
      122            Artificial surfaces       Industrial, commercial and transport units         Road and rail networks and associated land  204-000-000
      123            Artificial surfaces       Industrial, commercial and transport units                                         Port areas  230-204-204
      124            Artificial surfaces       Industrial, commercial and transport units                                           Airports  230-204-230
      131            Artificial surfaces                Mine, dump and construction sites                           Mineral extraction sites  166-000-204
      132            Artificial surfaces                Mine, dump and construction sites                                         Dump sites  166-077-000
      133            Artificial surfaces                Mine, dump and construction sites                                 Construction sites  255-077-255
      141            Artificial surfaces     Artificial, non-agricultural vegetated areas                                  Green urban areas  255-166-255
     142            Artificial surfaces     Artificial, non-agricultural vegetated areas                       Sport and leisure facilities  255-230-255
     211             Agricultural areas                                      Arable land                          Non-irrigated arable land  255-255-168
     222             Agricultural areas                                  Permanent crops                  Fruit trees and berry plantations  242-166-077
     231             Agricultural areas                                         Pastures                                           Pastures  230-230-077
     242             Agricultural areas                 Heterogeneous agricultural areas                       Complex cultivation patterns  255-230-077
     243             Agricultural areas                 Heterogeneous agricultural areas  Land principally occupied by agriculture, with...  230-204-077
     311  Forest and semi natural areas                                          Forests                                Broad-leaved forest  128-255-000
     312  Forest and semi natural areas                                          Forests                                  Coniferous forest  000-166-000
     313  Forest and semi natural areas                                          Forests                                       Mixed forest  077-255-000
     321  Forest and semi natural areas  Scrub and/or herbaceous vegetation associations                                 Natural grasslands  204-242-077
     322  Forest and semi natural areas  Scrub and/or herbaceous vegetation associations                                Moors and heathland  166-255-128
     324  Forest and semi natural areas  Scrub and/or herbaceous vegetation associations                        Transitional woodland-shrub  166-242-000
     331  Forest and semi natural areas         Open spaces with little or no vegetation                              Beaches, dunes, sands  230-230-230
     333  Forest and semi natural areas         Open spaces with little or no vegetation                           Sparsely vegetated areas  204-255-204
     411                       Wetlands                                  Inland wetlands                                     Inland marshes  166-166-255
     412                       Wetlands                                  Inland wetlands                                          Peat bogs  077-077-255
     421                       Wetlands                                Maritime wetlands                                       Salt marshes  204-204-255
     511                   Water bodies                                    Inland waters                                      Water courses  000-204-242
     512                   Water bodies                                    Inland waters                                       Water bodies  128-242-230
     521                   Water bodies                                    Marine waters                                    Coastal lagoons  000-255-166
     523                   Water bodies                                    Marine waters                                      Sea and ocean  230-242-255

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')

In [9]: lulc.sample(10)
Out[9]: 
    code_12          id remark     area_ha  ...                         LABEL1                                           LABEL2                                             LABEL3          RGB
136     243  EU-2034427   None  148.476483  ...             Agricultural areas                 Heterogeneous agricultural areas  Land principally occupied by agriculture, with...  230-204-077
239     324  EU-2052422   None   90.313232  ...  Forest and semi natural areas  Scrub and/or herbaceous vegetation associations                        Transitional woodland-shrub  166-242-000
167     313  EU-2046121   None   27.677425  ...  Forest and semi natural areas                                          Forests                                       Mixed forest  077-255-000
70      231  EU-2030036   None   46.906026  ...             Agricultural areas                                         Pastures                                           Pastures  230-230-077
63      231  EU-2029875   None   81.661622  ...             Agricultural areas                                         Pastures                                           Pastures  230-230-077
55      211  EU-2026956   None  310.846714  ...             Agricultural areas                                      Arable land                          Non-irrigated arable land  255-255-168
80      242  EU-2032462   None   83.893750  ...             Agricultural areas                 Heterogeneous agricultural areas                       Complex cultivation patterns  255-230-077
109     243  EU-2034192   None   29.355642  ...             Agricultural areas                 Heterogeneous agricultural areas  Land principally occupied by agriculture, with...  230-204-077
144     312  EU-2041007   None   54.119761  ...  Forest and semi natural areas                                          Forests                                  Coniferous forest  000-166-000
81      242  EU-2032473   None   54.556966  ...             Agricultural areas                 Heterogeneous agricultural areas                       Complex cultivation patterns  255-230-077

[10 rows x 13 columns]

# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]: 
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')

Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns
In [12]: lulc = lulc[selected_cols]

# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]: 
    Landuse                            LABEL2                                           geometry
63     PAST                          Pastures  POLYGON ((663852.277 6457100.927, 663820.431 6...
15     AGRL                       Arable land  POLYGON ((667346.056 6448925.436, 667282.868 6...
142    FRSD                           Forests  POLYGON ((659960.470 6466484.630, 659990.373 6...
114    AGRL  Heterogeneous agricultural areas  MULTIPOLYGON (((650961.058 6453763.991, 650960...
174    FRST                           Forests  POLYGON ((665445.498 6448844.493, 665364.120 6...
21     AGRL                       Arable land  POLYGON ((659696.375 6454079.995, 659655.429 6...
199    FRST                           Forests  POLYGON ((655580.368 6459091.492, 655605.341 6...
130    AGRL  Heterogeneous agricultural areas  POLYGON ((656282.430 6459869.487, 656317.304 6...
20     AGRL                       Arable land  POLYGON ((655302.115 6453709.695, 655281.748 6...
55     AGRL                       Arable land  MULTIPOLYGON (((663877.511 6466895.588, 663738...

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.

Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"

# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')

Let’s make sure that the coordinate reference system of the layers are identical:

# Check the crs of landuse
In [16]: lulc.crs
Out[16]: 
<Derived Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]: 
<Derived Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", predicate="within")

# Let's check the result
In [20]: join.head()
Out[20]: 
     OBJECTID        LIIK               NIMI   EXT_SYST_I    KKR_KOOD  ...  LAADIMISKP                        geometry  index_right Landuse                                           LABEL2
126    144115  taimed III  läikiv kurdsirbik  -1324418359  KLO9400024  ...  2018-10-29  POINT (652641.854 6458099.088)          224    RNGB  Scrub and/or herbaceous vegetation associations
132    144202  taimed III    siberi võhumõõk   1344032932  KLO9312850  ...  2018-10-29  POINT (660367.911 6469599.134)          143    FRSD                                          Forests
157    151696  taimed III     balti sõrmkäpp   -376320786  KLO9317690  ...  2018-10-29  POINT (656609.812 6459582.666)          201    FRST                                          Forests
201    154300  taimed III  harilik käoraamat   -403691661  KLO9303449  ...  2018-10-29  POINT (657356.897 6459780.112)          201    FRST                                          Forests
257    154692  taimed III       suur käopõll   1784194494  KLO9303464  ...  2018-10-29  POINT (657291.854 6459783.092)          201    FRST                                          Forests

[5 rows x 13 columns]

Awesome! Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

Let’s save this layer into a new Shapefile:

# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"

# Save to disk
join.to_file(outfp)

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

In [21]: join['NIMI'].value_counts()
Out[21]: 
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
sulgjas õhik               2
harilik käoraamat          2
siberi võhumõõk            2
läikiv kurdsirbik          1
valgelaup-rabakiil         1
Veski                      1
kodukakk                   1
hink                       1
vingerjas                  1
kuradi-sõrmkäpp            1
ohakasoomukas              1
suur-kuldtiib              1
roomav öövilge             1
hall käpp                  1
suur-rabakiil              1
Name: NIMI, dtype: int64

In [22]: join['LABEL2'].value_counts()
Out[22]: 
Forests                                            52
Scrub and/or herbaceous vegetation associations    22
Heterogeneous agricultural areas                   22
Pastures                                            4
Urban fabric                                        2
Inland waters                                       2
Arable land                                         2
Inland wetlands                                     1
Name: LABEL2, dtype: int64

# initialise empty list
In [23]: data_list = []

In [24]: for species_id, species_group in join.groupby('NIMI'):
   ....:     lulc_count = species_group['LABEL2'].value_counts()
   ....:     top = lulc_count.head(1)
   ....:     data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
   ....: 

# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)

# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]: 
                species_id  all_sights                                         top_lulc  sights_in_top
18         valge-toonekurg          20                 Heterogeneous agricultural areas             15
2           balti sõrmkäpp          20                                          Forests             12
6   kahkjaspunane sõrmkäpp          17                                          Forests             12
15            suur käopõll          16                                          Forests             14
1             aas-karukell           6                                          Forests              4
13            soo-neiuvaip           5  Scrub and/or herbaceous vegetation associations              3
21       vööthuul-sõrmkäpp           5                                          Forests              3
4        harilik käoraamat           2                                          Forests              1
12         siberi võhumõõk           2                                          Forests              1
14            sulgjas õhik           2  Scrub and/or herbaceous vegetation associations              1

Launch in the web/MyBinder: