Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

  • "intersects"

  • "within"

  • "contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.

  • Unzip the file into into your L3 folder

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

  • Let’s read the data into geopandas and see what we have.

In [1]: import geopandas as gpd

# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"

# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]: 
  code_12          id remark    area_ha    shape_area  clc_int Landuse  \
0     112  EU-2024407   None  67.055321  670553.20630      112    URML   
1     112  EU-2024418   None  36.452500  364525.00295      112    URML   
2     112  EU-2024426   None  33.525145  335251.45070      112    URML   
3     112  EU-2024431   None  30.111925  301119.25420      112    URML   
4     112  EU-2024434   None  70.979465  709794.64765      112    URML   

                                            geometry  
0  POLYGON ((658854.791 6458244.203, 658826.936 6...  
1  POLYGON ((663553.865 6459840.806, 663570.622 6...  
2  POLYGON ((659006.349 6463680.667, 659031.241 6...  
3  POLYGON ((658401.027 6466951.556, 658518.679 6...  
4  POLYGON ((658348.310 6467489.990, 658040.310 6...  

Okey so we have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

In [5]: import pandas as pd

In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')

In [7]: codes
Out[7]: 
    CLC_CODE                         LABEL1  \
0        111            Artificial surfaces   
1        112            Artificial surfaces   
2        121            Artificial surfaces   
3        122            Artificial surfaces   
4        123            Artificial surfaces   
5        124            Artificial surfaces   
6        131            Artificial surfaces   
7        132            Artificial surfaces   
8        133            Artificial surfaces   
9        141            Artificial surfaces   
10       142            Artificial surfaces   
11       211             Agricultural areas   
12       222             Agricultural areas   
13       231             Agricultural areas   
14       242             Agricultural areas   
15       243             Agricultural areas   
16       311  Forest and semi natural areas   
17       312  Forest and semi natural areas   
18       313  Forest and semi natural areas   
19       321  Forest and semi natural areas   
20       322  Forest and semi natural areas   
21       324  Forest and semi natural areas   
22       331  Forest and semi natural areas   
23       333  Forest and semi natural areas   
24       411                       Wetlands   
25       412                       Wetlands   
26       421                       Wetlands   
27       511                   Water bodies   
28       512                   Water bodies   
29       521                   Water bodies   
30       523                   Water bodies   

                                             LABEL2  \
0                                      Urban fabric   
1                                      Urban fabric   
2        Industrial, commercial and transport units   
3        Industrial, commercial and transport units   
4        Industrial, commercial and transport units   
5        Industrial, commercial and transport units   
6                 Mine, dump and construction sites   
7                 Mine, dump and construction sites   
8                 Mine, dump and construction sites   
9      Artificial, non-agricultural vegetated areas   
10     Artificial, non-agricultural vegetated areas   
11                                      Arable land   
12                                  Permanent crops   
13                                         Pastures   
14                 Heterogeneous agricultural areas   
15                 Heterogeneous agricultural areas   
16                                          Forests   
17                                          Forests   
18                                          Forests   
19  Scrub and/or herbaceous vegetation associations   
20  Scrub and/or herbaceous vegetation associations   
21  Scrub and/or herbaceous vegetation associations   
22         Open spaces with little or no vegetation   
23         Open spaces with little or no vegetation   
24                                  Inland wetlands   
25                                  Inland wetlands   
26                                Maritime wetlands   
27                                    Inland waters   
28                                    Inland waters   
29                                    Marine waters   
30                                    Marine waters   

                                               LABEL3          RGB  
0                             Continuous urban fabric  230-000-077  
1                          Discontinuous urban fabric  255-000-000  
2                      Industrial or commercial units  204-077-242  
3          Road and rail networks and associated land  204-000-000  
4                                          Port areas  230-204-204  
5                                            Airports  230-204-230  
6                            Mineral extraction sites  166-000-204  
7                                          Dump sites  166-077-000  
8                                  Construction sites  255-077-255  
9                                   Green urban areas  255-166-255  
10                       Sport and leisure facilities  255-230-255  
11                          Non-irrigated arable land  255-255-168  
12                  Fruit trees and berry plantations  242-166-077  
13                                           Pastures  230-230-077  
14                       Complex cultivation patterns  255-230-077  
15  Land principally occupied by agriculture, with...  230-204-077  
16                                Broad-leaved forest  128-255-000  
17                                  Coniferous forest  000-166-000  
18                                       Mixed forest  077-255-000  
19                                 Natural grasslands  204-242-077  
20                                Moors and heathland  166-255-128  
21                        Transitional woodland-shrub  166-242-000  
22                              Beaches, dunes, sands  230-230-230  
23                           Sparsely vegetated areas  204-255-204  
24                                     Inland marshes  166-166-255  
25                                          Peat bogs  077-077-255  
26                                       Salt marshes  204-204-255  
27                                      Water courses  000-204-242  
28                                       Water bodies  128-242-230  
29                                    Coastal lagoons  000-255-166  
30                                      Sea and ocean  230-242-255  

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')

In [9]: lulc.sample(10)
Out[9]: 
    code_12          id remark      area_ha    shape_area  clc_int Landuse  \
30      211  EU-2026721   None    51.589110  5.158911e+05      211    AGRL   
204     313  EU-2046436   None   178.733078  1.787331e+06      313    FRST   
46      211  EU-2026822   None    37.432131  3.743213e+05      211    AGRL   
16      211  EU-2026633   None    34.944840  3.494484e+05      211    AGRL   
57      231  EU-2029756   None    57.420560  5.742056e+05      231    PAST   
118     243  EU-2034261   None    26.894219  2.689422e+05      243    AGRL   
108     243  EU-2034191   None  1413.048177  1.413048e+07      243    AGRL   
69      231  EU-2030017   None   785.112680  7.851127e+06      231    PAST   
116     243  EU-2034238   None    62.887154  6.288715e+05      243    AGRL   
154     312  EU-2041206   None    90.963689  9.096369e+05      312    FRSE   

                                              geometry  CLC_CODE  \
30   POLYGON ((664580.232 6456124.995, 664577.116 6...       211   
204  POLYGON ((661448.216 6462739.348, 661435.432 6...       313   
46   POLYGON ((662124.874 6462823.495, 662220.245 6...       211   
16   POLYGON ((668192.805 6450263.495, 668179.121 6...       211   
57   POLYGON ((662912.623 6448033.995, 662855.868 6...       231   
118  POLYGON ((663042.054 6453857.496, 662837.556 6...       243   
108  MULTIPOLYGON (((654082.681 6452253.489, 654077...       243   
69   POLYGON ((662568.340 6468060.656, 662619.927 6...       231   
116  POLYGON ((660278.062 6452665.995, 660167.995 6...       243   
154  POLYGON ((657220.436 6458048.495, 657254.620 6...       312   

                            LABEL1                            LABEL2  \
30              Agricultural areas                       Arable land   
204  Forest and semi natural areas                           Forests   
46              Agricultural areas                       Arable land   
16              Agricultural areas                       Arable land   
57              Agricultural areas                          Pastures   
118             Agricultural areas  Heterogeneous agricultural areas   
108             Agricultural areas  Heterogeneous agricultural areas   
69              Agricultural areas                          Pastures   
116             Agricultural areas  Heterogeneous agricultural areas   
154  Forest and semi natural areas                           Forests   

                                                LABEL3          RGB  
30                           Non-irrigated arable land  255-255-168  
204                                       Mixed forest  077-255-000  
46                           Non-irrigated arable land  255-255-168  
16                           Non-irrigated arable land  255-255-168  
57                                            Pastures  230-230-077  
118  Land principally occupied by agriculture, with...  230-204-077  
108  Land principally occupied by agriculture, with...  230-204-077  
69                                            Pastures  230-230-077  
116  Land principally occupied by agriculture, with...  230-204-077  
154                                  Coniferous forest  000-166-000  
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]: 
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')
  • Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns
In [12]: lulc = lulc[selected_cols]

# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]: 
    Landuse                                           LABEL2  \
80     AGRL                 Heterogeneous agricultural areas   
241    WETL                                  Inland wetlands   
130    AGRL                 Heterogeneous agricultural areas   
65     PAST                                         Pastures   
21     AGRL                                      Arable land   
49     AGRL                                      Arable land   
55     AGRL                                      Arable land   
4      URML                                     Urban fabric   
161    FRSE                                          Forests   
230    RNGB  Scrub and/or herbaceous vegetation associations   

                                              geometry  
80   POLYGON ((661999.553 6456694.993, 662158.623 6...  
241  POLYGON ((660625.437 6469578.988, 660637.496 6...  
130  POLYGON ((656282.430 6459869.487, 656317.304 6...  
65   POLYGON ((658319.810 6460064.491, 658182.936 6...  
21   POLYGON ((659696.375 6454079.995, 659655.429 6...  
49   POLYGON ((664101.431 6464658.061, 664168.493 6...  
55   MULTIPOLYGON (((663877.511 6466895.588, 663738...  
4    POLYGON ((658348.310 6467489.990, 658040.310 6...  
161  POLYGON ((654532.876 6461046.171, 654635.125 6...  
230  POLYGON ((662872.310 6461090.487, 662990.991 6...  

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.

  • Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"

# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
  • Let’s make sure that the coordinate reference system of the layers are identical

# Check the crs of landuse
In [16]: lulc.crs
Out[16]: 
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]: 
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

  • Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")

# Let's check the result
In [20]: join.head()
Out[20]: 
     OBJECTID        LIIK               NIMI   EXT_SYST_I    KKR_KOOD  \
126    144115  taimed III  läikiv kurdsirbik  -1324418359  KLO9400024   
132    144202  taimed III    siberi võhumõõk   1344032932  KLO9312850   
157    151696  taimed III     balti sõrmkäpp   -376320786  KLO9317690   
201    154300  taimed III  harilik käoraamat   -403691661  KLO9303449   
257    154692  taimed III       suur käopõll   1784194494  KLO9303464   

    PRIV_TYYP       STAATUS  IMPORT  LAADIMISKP  \
126    Avalik  kontrollitud       0  2018-10-29   
132    Avalik  kontrollitud       0  2018-10-29   
157    Avalik  kontrollitud       0  2018-10-29   
201    Avalik  kontrollitud       0  2018-10-29   
257    Avalik  kontrollitud       0  2018-10-29   

                           geometry  index_right Landuse  \
126  POINT (652641.854 6458099.088)          224    RNGB   
132  POINT (660367.911 6469599.134)          143    FRSD   
157  POINT (656609.812 6459582.666)          201    FRST   
201  POINT (657356.897 6459780.112)          201    FRST   
257  POINT (657291.854 6459783.092)          201    FRST   

                                              LABEL2  
126  Scrub and/or herbaceous vegetation associations  
132                                          Forests  
157                                          Forests  
201                                          Forests  
257                                          Forests  

Awesome! Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

  • Let’s save this layer into a new Shapefile

# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"

# Save to disk
join.to_file(outfp)

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

In [21]: join['NIMI'].value_counts()
Out[21]: 
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
vööthuul-sõrmkäpp          5
soo-neiuvaip               5
sulgjas õhik               2
siberi võhumõõk            2
harilik käoraamat          2
läikiv kurdsirbik          1
suur-kuldtiib              1
suur-rabakiil              1
roomav öövilge             1
kuradi-sõrmkäpp            1
hink                       1
Veski                      1
valgelaup-rabakiil         1
ohakasoomukas              1
kodukakk                   1
hall käpp                  1
vingerjas                  1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]: 
Forests                                            52
Heterogeneous agricultural areas                   22
Scrub and/or herbaceous vegetation associations    22
Pastures                                            4
Urban fabric                                        2
Inland waters                                       2
Arable land                                         2
Inland wetlands                                     1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []

In [24]: for species_id, species_group in join.groupby('NIMI'):
   ....:     lulc_count = species_group['LABEL2'].value_counts()
   ....:     top = lulc_count.head(1)
   ....:     data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
   ....: 

# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)

# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]: 
                species_id  all_sights  \
18         valge-toonekurg          20   
2           balti sõrmkäpp          20   
6   kahkjaspunane sõrmkäpp          17   
15            suur käopõll          16   
1             aas-karukell           6   
13            soo-neiuvaip           5   
21       vööthuul-sõrmkäpp           5   
4        harilik käoraamat           2   
12         siberi võhumõõk           2   
14            sulgjas õhik           2   

                                           top_lulc  sights_in_top  
18                 Heterogeneous agricultural areas             15  
2                                           Forests             12  
6                                           Forests             12  
15                                          Forests             14  
1                                           Forests              4  
13  Scrub and/or herbaceous vegetation associations              3  
21                                          Forests              3  
4   Scrub and/or herbaceous vegetation associations              1  
12                 Heterogeneous agricultural areas              1  
14  Scrub and/or herbaceous vegetation associations              1  

Launch in the web/MyBinder:

https://mybinder.org/badge_logo.svg