Spatial join¶

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

"intersects"
"within"
"contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data¶

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.

Unzip the file into into your L3 folder

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

Let’s read the data into geopandas and see what we have.

In [1]: import geopandas as gpd

# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"

# Read the data
In [3]: lulc = gpd.read_file(fp)

# See the first rows
In [4]: lulc.head()
Out[4]: 
  code_12          id remark    area_ha    shape_area  clc_int Landuse  \
0     112  EU-2024407   None  67.055321  670553.20630      112    URML   
1     112  EU-2024418   None  36.452500  364525.00295      112    URML   
2     112  EU-2024426   None  33.525145  335251.45070      112    URML   
3     112  EU-2024431   None  30.111925  301119.25420      112    URML   
4     112  EU-2024434   None  70.979465  709794.64765      112    URML   

                                            geometry  
0  POLYGON ((658854.791 6458244.203, 658826.936 6...  
1  POLYGON ((663553.865 6459840.806, 663570.622 6...  
2  POLYGON ((659006.349 6463680.667, 659031.241 6...  
3  POLYGON ((658401.027 6466951.556, 658518.679 6...  
4  POLYGON ((658348.310 6467489.990, 658040.310 6...  

Okey so we have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

In [5]: import pandas as pd

In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')

In [7]: codes
Out[7]: 
    CLC_CODE                         LABEL1  \
      111            Artificial surfaces   
      112            Artificial surfaces   
      121            Artificial surfaces   
      122            Artificial surfaces   
      123            Artificial surfaces   
      124            Artificial surfaces   
      131            Artificial surfaces   
      132            Artificial surfaces   
      133            Artificial surfaces   
      141            Artificial surfaces   
     142            Artificial surfaces   
     211             Agricultural areas   
     222             Agricultural areas   
     231             Agricultural areas   
     242             Agricultural areas   
     243             Agricultural areas   
     311  Forest and semi natural areas   
     312  Forest and semi natural areas   
     313  Forest and semi natural areas   
     321  Forest and semi natural areas   
     322  Forest and semi natural areas   
     324  Forest and semi natural areas   
     331  Forest and semi natural areas   
     333  Forest and semi natural areas   
     411                       Wetlands   
     412                       Wetlands   
     421                       Wetlands   
     511                   Water bodies   
     512                   Water bodies   
     521                   Water bodies   
     523                   Water bodies   

                                             LABEL2  \
                                    Urban fabric   
                                    Urban fabric   
      Industrial, commercial and transport units   
      Industrial, commercial and transport units   
      Industrial, commercial and transport units   
      Industrial, commercial and transport units   
               Mine, dump and construction sites   
               Mine, dump and construction sites   
               Mine, dump and construction sites   
    Artificial, non-agricultural vegetated areas   
   Artificial, non-agricultural vegetated areas   
                                    Arable land   
                                Permanent crops   
                                       Pastures   
               Heterogeneous agricultural areas   
               Heterogeneous agricultural areas   
                                        Forests   
                                        Forests   
                                        Forests   
Scrub and/or herbaceous vegetation associations   
Scrub and/or herbaceous vegetation associations   
Scrub and/or herbaceous vegetation associations   
       Open spaces with little or no vegetation   
       Open spaces with little or no vegetation   
                                Inland wetlands   
                                Inland wetlands   
                              Maritime wetlands   
                                  Inland waters   
                                  Inland waters   
                                  Marine waters   
                                  Marine waters   

                                               LABEL3          RGB  
                           Continuous urban fabric  230-000-077  
                        Discontinuous urban fabric  255-000-000  
                    Industrial or commercial units  204-077-242  
        Road and rail networks and associated land  204-000-000  
                                        Port areas  230-204-204  
                                          Airports  230-204-230  
                          Mineral extraction sites  166-000-204  
                                        Dump sites  166-077-000  
                                Construction sites  255-077-255  
                                 Green urban areas  255-166-255  
                     Sport and leisure facilities  255-230-255  
                        Non-irrigated arable land  255-255-168  
                Fruit trees and berry plantations  242-166-077  
                                         Pastures  230-230-077  
                     Complex cultivation patterns  255-230-077  
Land principally occupied by agriculture, with...  230-204-077  
                              Broad-leaved forest  128-255-000  
                                Coniferous forest  000-166-000  
                                     Mixed forest  077-255-000  
                               Natural grasslands  204-242-077  
                              Moors and heathland  166-255-128  
                      Transitional woodland-shrub  166-242-000  
                            Beaches, dunes, sands  230-230-230  
                         Sparsely vegetated areas  204-255-204  
                                   Inland marshes  166-166-255  
                                        Peat bogs  077-077-255  
                                     Salt marshes  204-204-255  
                                    Water courses  000-204-242  
                                     Water bodies  128-242-230  
                                  Coastal lagoons  000-255-166  
                                    Sea and ocean  230-242-255  

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')

In [9]: lulc.sample(10)
Out[9]: 
    code_12          id remark      area_ha    shape_area  clc_int Landuse  \
    211  EU-2026721   None    51.589110  5.158911e+05      211    AGRL   
   313  EU-2046436   None   178.733078  1.787331e+06      313    FRST   
    211  EU-2026822   None    37.432131  3.743213e+05      211    AGRL   
    211  EU-2026633   None    34.944840  3.494484e+05      211    AGRL   
    231  EU-2029756   None    57.420560  5.742056e+05      231    PAST   
   243  EU-2034261   None    26.894219  2.689422e+05      243    AGRL   
   243  EU-2034191   None  1413.048177  1.413048e+07      243    AGRL   
    231  EU-2030017   None   785.112680  7.851127e+06      231    PAST   
   243  EU-2034238   None    62.887154  6.288715e+05      243    AGRL   
   312  EU-2041206   None    90.963689  9.096369e+05      312    FRSE   

                                              geometry  CLC_CODE  \
 POLYGON ((664580.232 6456124.995, 664577.116 6...       211   
POLYGON ((661448.216 6462739.348, 661435.432 6...       313   
 POLYGON ((662124.874 6462823.495, 662220.245 6...       211   
 POLYGON ((668192.805 6450263.495, 668179.121 6...       211   
 POLYGON ((662912.623 6448033.995, 662855.868 6...       231   
POLYGON ((663042.054 6453857.496, 662837.556 6...       243   
MULTIPOLYGON (((654082.681 6452253.489, 654077...       243   
 POLYGON ((662568.340 6468060.656, 662619.927 6...       231   
POLYGON ((660278.062 6452665.995, 660167.995 6...       243   
POLYGON ((657220.436 6458048.495, 657254.620 6...       312   

                            LABEL1                            LABEL2  \
            Agricultural areas                       Arable land   
Forest and semi natural areas                           Forests   
            Agricultural areas                       Arable land   
            Agricultural areas                       Arable land   
            Agricultural areas                          Pastures   
           Agricultural areas  Heterogeneous agricultural areas   
           Agricultural areas  Heterogeneous agricultural areas   
            Agricultural areas                          Pastures   
           Agricultural areas  Heterogeneous agricultural areas   
Forest and semi natural areas                           Forests   

                                                LABEL3          RGB  
                         Non-irrigated arable land  255-255-168  
                                     Mixed forest  077-255-000  
                         Non-irrigated arable land  255-255-168  
                         Non-irrigated arable land  255-255-168  
                                          Pastures  230-230-077  
Land principally occupied by agriculture, with...  230-204-077  
Land principally occupied by agriculture, with...  230-204-077  
                                          Pastures  230-230-077  
Land principally occupied by agriculture, with...  230-204-077  
                                Coniferous forest  000-166-000  

# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]: 
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')

Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns
In [12]: lulc = lulc[selected_cols]

# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]: 
    Landuse                                           LABEL2  \
80     AGRL                 Heterogeneous agricultural areas   
241    WETL                                  Inland wetlands   
130    AGRL                 Heterogeneous agricultural areas   
65     PAST                                         Pastures   
21     AGRL                                      Arable land   
49     AGRL                                      Arable land   
55     AGRL                                      Arable land   
4      URML                                     Urban fabric   
161    FRSE                                          Forests   
230    RNGB  Scrub and/or herbaceous vegetation associations   

                                              geometry  
80   POLYGON ((661999.553 6456694.993, 662158.623 6...  
241  POLYGON ((660625.437 6469578.988, 660637.496 6...  
130  POLYGON ((656282.430 6459869.487, 656317.304 6...  
65   POLYGON ((658319.810 6460064.491, 658182.936 6...  
21   POLYGON ((659696.375 6454079.995, 659655.429 6...  
49   POLYGON ((664101.431 6464658.061, 664168.493 6...  
55   MULTIPOLYGON (((663877.511 6466895.588, 663738...  
4    POLYGON ((658348.310 6467489.990, 658040.310 6...  
161  POLYGON ((654532.876 6461046.171, 654635.125 6...  
230  POLYGON ((662872.310 6461090.487, 662990.991 6...  

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers¶

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.

Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"

# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')

Let’s make sure that the coordinate reference system of the layers are identical

# Check the crs of landuse
In [16]: lulc.crs
Out[16]: 
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]: 
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")

# Let's check the result
In [20]: join.head()
Out[20]: 
     OBJECTID        LIIK               NIMI   EXT_SYST_I    KKR_KOOD  \
126    144115  taimed III  läikiv kurdsirbik  -1324418359  KLO9400024   
132    144202  taimed III    siberi võhumõõk   1344032932  KLO9312850   
157    151696  taimed III     balti sõrmkäpp   -376320786  KLO9317690   
201    154300  taimed III  harilik käoraamat   -403691661  KLO9303449   
257    154692  taimed III       suur käopõll   1784194494  KLO9303464   

    PRIV_TYYP       STAATUS  IMPORT  LAADIMISKP  \
126    Avalik  kontrollitud       0  2018-10-29   
132    Avalik  kontrollitud       0  2018-10-29   
157    Avalik  kontrollitud       0  2018-10-29   
201    Avalik  kontrollitud       0  2018-10-29   
257    Avalik  kontrollitud       0  2018-10-29   

                           geometry  index_right Landuse  \
126  POINT (652641.854 6458099.088)          224    RNGB   
132  POINT (660367.911 6469599.134)          143    FRSD   
157  POINT (656609.812 6459582.666)          201    FRST   
201  POINT (657356.897 6459780.112)          201    FRST   
257  POINT (657291.854 6459783.092)          201    FRST   

                                              LABEL2  
126  Scrub and/or herbaceous vegetation associations  
132                                          Forests  
157                                          Forests  
201                                          Forests  
257                                          Forests  

Awesome! Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

Let’s save this layer into a new Shapefile

# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"

# Save to disk
join.to_file(outfp)

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

In [21]: join['NIMI'].value_counts()
Out[21]: 
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
vööthuul-sõrmkäpp          5
soo-neiuvaip               5
sulgjas õhik               2
siberi võhumõõk            2
harilik käoraamat          2
läikiv kurdsirbik          1
suur-kuldtiib              1
suur-rabakiil              1
roomav öövilge             1
kuradi-sõrmkäpp            1
hink                       1
Veski                      1
valgelaup-rabakiil         1
ohakasoomukas              1
kodukakk                   1
hall käpp                  1
vingerjas                  1
Name: NIMI, dtype: int64

In [22]: join['LABEL2'].value_counts()
Out[22]: 
Forests                                            52
Heterogeneous agricultural areas                   22
Scrub and/or herbaceous vegetation associations    22
Pastures                                            4
Urban fabric                                        2
Inland waters                                       2
Arable land                                         2
Inland wetlands                                     1
Name: LABEL2, dtype: int64

# initialise empty list
In [23]: data_list = []

In [24]: for species_id, species_group in join.groupby('NIMI'):
   ....:     lulc_count = species_group['LABEL2'].value_counts()
   ....:     top = lulc_count.head(1)
   ....:     data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
   ....: 

# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)

# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]: 
                species_id  all_sights  \
18         valge-toonekurg          20   
2           balti sõrmkäpp          20   
6   kahkjaspunane sõrmkäpp          17   
15            suur käopõll          16   
1             aas-karukell           6   
13            soo-neiuvaip           5   
21       vööthuul-sõrmkäpp           5   
4        harilik käoraamat           2   
12         siberi võhumõõk           2   
14            sulgjas õhik           2   

                                           top_lulc  sights_in_top  
18                 Heterogeneous agricultural areas             15  
2                                           Forests             12  
6                                           Forests             12  
15                                          Forests             14  
1                                           Forests              4  
13  Scrub and/or herbaceous vegetation associations              3  
21                                          Forests              3  
4   Scrub and/or herbaceous vegetation associations              1  
12                 Heterogeneous agricultural areas              1  
14  Scrub and/or herbaceous vegetation associations              1  

Launch in the web/MyBinder: