Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

  • "intersects"

  • "within"

  • "contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.

  • Unzip the file into into your L3 folder

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

  • Let’s read the data into geopandas and see what we have.

In [1]: import geopandas as gpd

# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"

# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]: 
  code_12          id remark    area_ha    shape_area  clc_int Landuse  \
0     112  EU-2024407   None  67.055321  670553.20630      112    URML   
1     112  EU-2024418   None  36.452500  364525.00295      112    URML   
2     112  EU-2024426   None  33.525145  335251.45070      112    URML   
3     112  EU-2024431   None  30.111925  301119.25420      112    URML   
4     112  EU-2024434   None  70.979465  709794.64765      112    URML   

                                            geometry  
0  POLYGON ((658854.791 6458244.203, 658826.936 6...  
1  POLYGON ((663553.865 6459840.806, 663570.622 6...  
2  POLYGON ((659006.349 6463680.667, 659031.241 6...  
3  POLYGON ((658401.027 6466951.556, 658518.679 6...  
4  POLYGON ((658348.310 6467489.990, 658040.310 6...  

Okey so we have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

In [5]: import pandas as pd

In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')

In [7]: codes
Out[7]: 
    CLC_CODE                         LABEL1  \
0        111            Artificial surfaces   
1        112            Artificial surfaces   
2        121            Artificial surfaces   
3        122            Artificial surfaces   
4        123            Artificial surfaces   
5        124            Artificial surfaces   
6        131            Artificial surfaces   
7        132            Artificial surfaces   
8        133            Artificial surfaces   
9        141            Artificial surfaces   
10       142            Artificial surfaces   
11       211             Agricultural areas   
12       222             Agricultural areas   
13       231             Agricultural areas   
14       242             Agricultural areas   
15       243             Agricultural areas   
16       311  Forest and semi natural areas   
17       312  Forest and semi natural areas   
18       313  Forest and semi natural areas   
19       321  Forest and semi natural areas   
20       322  Forest and semi natural areas   
21       324  Forest and semi natural areas   
22       331  Forest and semi natural areas   
23       333  Forest and semi natural areas   
24       411                       Wetlands   
25       412                       Wetlands   
26       421                       Wetlands   
27       511                   Water bodies   
28       512                   Water bodies   
29       521                   Water bodies   
30       523                   Water bodies   

                                             LABEL2  \
0                                      Urban fabric   
1                                      Urban fabric   
2        Industrial, commercial and transport units   
3        Industrial, commercial and transport units   
4        Industrial, commercial and transport units   
5        Industrial, commercial and transport units   
6                 Mine, dump and construction sites   
7                 Mine, dump and construction sites   
8                 Mine, dump and construction sites   
9      Artificial, non-agricultural vegetated areas   
10     Artificial, non-agricultural vegetated areas   
11                                      Arable land   
12                                  Permanent crops   
13                                         Pastures   
14                 Heterogeneous agricultural areas   
15                 Heterogeneous agricultural areas   
16                                          Forests   
17                                          Forests   
18                                          Forests   
19  Scrub and/or herbaceous vegetation associations   
20  Scrub and/or herbaceous vegetation associations   
21  Scrub and/or herbaceous vegetation associations   
22         Open spaces with little or no vegetation   
23         Open spaces with little or no vegetation   
24                                  Inland wetlands   
25                                  Inland wetlands   
26                                Maritime wetlands   
27                                    Inland waters   
28                                    Inland waters   
29                                    Marine waters   
30                                    Marine waters   

                                               LABEL3          RGB  
0                             Continuous urban fabric  230-000-077  
1                          Discontinuous urban fabric  255-000-000  
2                      Industrial or commercial units  204-077-242  
3          Road and rail networks and associated land  204-000-000  
4                                          Port areas  230-204-204  
5                                            Airports  230-204-230  
6                            Mineral extraction sites  166-000-204  
7                                          Dump sites  166-077-000  
8                                  Construction sites  255-077-255  
9                                   Green urban areas  255-166-255  
10                       Sport and leisure facilities  255-230-255  
11                          Non-irrigated arable land  255-255-168  
12                  Fruit trees and berry plantations  242-166-077  
13                                           Pastures  230-230-077  
14                       Complex cultivation patterns  255-230-077  
15  Land principally occupied by agriculture, with...  230-204-077  
16                                Broad-leaved forest  128-255-000  
17                                  Coniferous forest  000-166-000  
18                                       Mixed forest  077-255-000  
19                                 Natural grasslands  204-242-077  
20                                Moors and heathland  166-255-128  
21                        Transitional woodland-shrub  166-242-000  
22                              Beaches, dunes, sands  230-230-230  
23                           Sparsely vegetated areas  204-255-204  
24                                     Inland marshes  166-166-255  
25                                          Peat bogs  077-077-255  
26                                       Salt marshes  204-204-255  
27                                      Water courses  000-204-242  
28                                       Water bodies  128-242-230  
29                                    Coastal lagoons  000-255-166  
30                                      Sea and ocean  230-242-255  

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')

In [9]: lulc.sample(10)
Out[9]: 
    code_12          id remark     area_ha    shape_area  clc_int Landuse  \
107     243  EU-2034187   None   42.361981  4.236198e+05      243    AGRL   
39      211  EU-2026767   None  657.003251  6.570033e+06      211    AGRL   
147     312  EU-2041029   None   43.920683  4.392068e+05      312    FRSE   
58      231  EU-2029759   None   92.973231  9.297323e+05      231    PAST   
116     243  EU-2034238   None   62.887154  6.288715e+05      243    AGRL   
64      231  EU-2029878   None  256.324714  2.563247e+06      231    PAST   
93      242  EU-2032606   None   99.110014  9.911001e+05      242    AGRL   
179     313  EU-2046222   None   53.232284  5.323228e+05      313    FRST   
171     313  EU-2046156   None   60.916253  6.091625e+05      313    FRST   
99      243  EU-2034113   None  122.314246  1.223142e+06      243    AGRL   

                                              geometry  CLC_CODE  \
107  POLYGON ((669087.936 6449236.491, 669076.938 6...       243   
39   POLYGON ((661203.930 6460226.991, 661289.184 6...       211   
147  POLYGON ((666235.246 6446880.994, 666189.937 6...       312   
58   POLYGON ((658221.047 6449814.551, 658231.796 6...       231   
116  POLYGON ((660278.062 6452665.995, 660167.995 6...       243   
64   POLYGON ((655454.919 6458495.026, 655094.679 6...       231   
93   POLYGON ((664829.263 6465229.475, 664866.937 6...       242   
179  POLYGON ((665337.868 6452005.635, 665346.810 6...       313   
171  POLYGON ((663292.056 6449565.496, 663313.305 6...       313   
99   POLYGON ((658469.924 6447204.593, 658475.492 6...       243   

                            LABEL1                            LABEL2  \
107             Agricultural areas  Heterogeneous agricultural areas   
39              Agricultural areas                       Arable land   
147  Forest and semi natural areas                           Forests   
58              Agricultural areas                          Pastures   
116             Agricultural areas  Heterogeneous agricultural areas   
64              Agricultural areas                          Pastures   
93              Agricultural areas  Heterogeneous agricultural areas   
179  Forest and semi natural areas                           Forests   
171  Forest and semi natural areas                           Forests   
99              Agricultural areas  Heterogeneous agricultural areas   

                                                LABEL3          RGB  
107  Land principally occupied by agriculture, with...  230-204-077  
39                           Non-irrigated arable land  255-255-168  
147                                  Coniferous forest  000-166-000  
58                                            Pastures  230-230-077  
116  Land principally occupied by agriculture, with...  230-204-077  
64                                            Pastures  230-230-077  
93                        Complex cultivation patterns  255-230-077  
179                                       Mixed forest  077-255-000  
171                                       Mixed forest  077-255-000  
99   Land principally occupied by agriculture, with...  230-204-077  
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]: 
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')
  • Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns
In [12]: lulc = lulc[selected_cols]

# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]: 
    Landuse                            LABEL2  \
200    FRST                           Forests   
83     AGRL  Heterogeneous agricultural areas   
51     AGRL                       Arable land   
194    FRST                           Forests   
15     AGRL                       Arable land   
199    FRST                           Forests   
169    FRST                           Forests   
26     AGRL                       Arable land   
103    AGRL  Heterogeneous agricultural areas   
44     AGRL                       Arable land   

                                              geometry  
200  MULTIPOLYGON (((666229.712 6456136.660, 666214...  
83   POLYGON ((665405.163 6457486.269, 665401.873 6...  
51   POLYGON ((662998.119 6465195.489, 663004.989 6...  
194  POLYGON ((665177.900 6456656.750, 665299.453 6...  
15   POLYGON ((667346.056 6448925.436, 667282.868 6...  
199  POLYGON ((655580.368 6459091.492, 655605.341 6...  
169  MULTIPOLYGON (((650215.358 6450779.373, 650257...  
26   POLYGON ((654851.749 6456870.487, 654760.741 6...  
103  POLYGON ((666635.395 6447379.819, 666686.560 6...  
44   POLYGON ((659091.304 6461255.987, 658975.936 6...  

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.

  • Read the category_3_species_porijogi.gpkg layer into memory

 # protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"

 # Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
  • Let’s make sure that the coordinate reference system of the layers are identical

# Check the crs of landuse
In [16]: lulc.crs
Out[16]: 
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]: 
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

  • Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")

# Let's check the result
In [20]: join.head()
Out[20]: 
     OBJECTID        LIIK               NIMI   EXT_SYST_I    KKR_KOOD  \
126    144115  taimed III  läikiv kurdsirbik  -1324418359  KLO9400024   
132    144202  taimed III    siberi võhumõõk   1344032932  KLO9312850   
157    151696  taimed III     balti sõrmkäpp   -376320786  KLO9317690   
201    154300  taimed III  harilik käoraamat   -403691661  KLO9303449   
257    154692  taimed III       suur käopõll   1784194494  KLO9303464   

    PRIV_TYYP       STAATUS  IMPORT  LAADIMISKP  \
126    Avalik  kontrollitud       0  2018-10-29   
132    Avalik  kontrollitud       0  2018-10-29   
157    Avalik  kontrollitud       0  2018-10-29   
201    Avalik  kontrollitud       0  2018-10-29   
257    Avalik  kontrollitud       0  2018-10-29   

                           geometry  index_right Landuse  \
126  POINT (652641.854 6458099.088)          224    RNGB   
132  POINT (660367.911 6469599.134)          143    FRSD   
157  POINT (656609.812 6459582.666)          201    FRST   
201  POINT (657356.897 6459780.112)          201    FRST   
257  POINT (657291.854 6459783.092)          201    FRST   

                                              LABEL2  
126  Scrub and/or herbaceous vegetation associations  
132                                          Forests  
157                                          Forests  
201                                          Forests  
257                                          Forests  

Awesome! Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

  • Let’s save this layer into a new Shapefile

# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"

# Save to disk
join.to_file(outfp)

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

In [21]: join['NIMI'].value_counts()
Out[21]: 
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
sulgjas õhik               2
harilik käoraamat          2
siberi võhumõõk            2
läikiv kurdsirbik          1
valgelaup-rabakiil         1
Veski                      1
kodukakk                   1
hink                       1
vingerjas                  1
kuradi-sõrmkäpp            1
ohakasoomukas              1
suur-kuldtiib              1
roomav öövilge             1
hall käpp                  1
suur-rabakiil              1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]: 
Forests                                            52
Scrub and/or herbaceous vegetation associations    22
Heterogeneous agricultural areas                   22
Pastures                                            4
Urban fabric                                        2
Inland waters                                       2
Arable land                                         2
Inland wetlands                                     1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []

In [24]: for species_id, species_group in join.groupby('NIMI'):
   ....:     lulc_count = species_group['LABEL2'].value_counts()
   ....:     top = lulc_count.head(1)
   ....:     data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
   ....: 

# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)

# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]: 
                species_id  all_sights  \
18         valge-toonekurg          20   
2           balti sõrmkäpp          20   
6   kahkjaspunane sõrmkäpp          17   
15            suur käopõll          16   
1             aas-karukell           6   
13            soo-neiuvaip           5   
21       vööthuul-sõrmkäpp           5   
4        harilik käoraamat           2   
12         siberi võhumõõk           2   
14            sulgjas õhik           2   

                                           top_lulc  sights_in_top  
18                 Heterogeneous agricultural areas             15  
2                                           Forests             12  
6                                           Forests             12  
15                                          Forests             14  
1                                           Forests              4  
13  Scrub and/or herbaceous vegetation associations              3  
21                                          Forests              3  
4                                           Forests              1  
12                                          Forests              1  
14  Scrub and/or herbaceous vegetation associations              1  

Launch in the web/MyBinder:

https://mybinder.org/badge_logo.svg