Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

  • "intersects"

  • "within"

  • "contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.

  • Unzip the file into into your L3 folder

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

  • Let’s read the data into geopandas and see what we have.

In [1]: import geopandas as gpd

# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"

# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]: 
  code_12          id remark    area_ha    shape_area  clc_int Landuse  \
0     112  EU-2024407   None  67.055321  670553.20630      112    URML   
1     112  EU-2024418   None  36.452500  364525.00295      112    URML   
2     112  EU-2024426   None  33.525145  335251.45070      112    URML   
3     112  EU-2024431   None  30.111925  301119.25420      112    URML   
4     112  EU-2024434   None  70.979465  709794.64765      112    URML   

                                            geometry  
0  POLYGON ((658854.791 6458244.203, 658826.936 6...  
1  POLYGON ((663553.865 6459840.806, 663570.622 6...  
2  POLYGON ((659006.349 6463680.667, 659031.241 6...  
3  POLYGON ((658401.027 6466951.556, 658518.679 6...  
4  POLYGON ((658348.310 6467489.990, 658040.310 6...  

Okey so we have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

In [5]: import pandas as pd

In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')

In [7]: codes
Out[7]: 
    CLC_CODE                         LABEL1  \
0        111            Artificial surfaces   
1        112            Artificial surfaces   
2        121            Artificial surfaces   
3        122            Artificial surfaces   
4        123            Artificial surfaces   
5        124            Artificial surfaces   
6        131            Artificial surfaces   
7        132            Artificial surfaces   
8        133            Artificial surfaces   
9        141            Artificial surfaces   
10       142            Artificial surfaces   
11       211             Agricultural areas   
12       222             Agricultural areas   
13       231             Agricultural areas   
14       242             Agricultural areas   
15       243             Agricultural areas   
16       311  Forest and semi natural areas   
17       312  Forest and semi natural areas   
18       313  Forest and semi natural areas   
19       321  Forest and semi natural areas   
20       322  Forest and semi natural areas   
21       324  Forest and semi natural areas   
22       331  Forest and semi natural areas   
23       333  Forest and semi natural areas   
24       411                       Wetlands   
25       412                       Wetlands   
26       421                       Wetlands   
27       511                   Water bodies   
28       512                   Water bodies   
29       521                   Water bodies   
30       523                   Water bodies   

                                             LABEL2  \
0                                      Urban fabric   
1                                      Urban fabric   
2        Industrial, commercial and transport units   
3        Industrial, commercial and transport units   
4        Industrial, commercial and transport units   
5        Industrial, commercial and transport units   
6                 Mine, dump and construction sites   
7                 Mine, dump and construction sites   
8                 Mine, dump and construction sites   
9      Artificial, non-agricultural vegetated areas   
10     Artificial, non-agricultural vegetated areas   
11                                      Arable land   
12                                  Permanent crops   
13                                         Pastures   
14                 Heterogeneous agricultural areas   
15                 Heterogeneous agricultural areas   
16                                          Forests   
17                                          Forests   
18                                          Forests   
19  Scrub and/or herbaceous vegetation associations   
20  Scrub and/or herbaceous vegetation associations   
21  Scrub and/or herbaceous vegetation associations   
22         Open spaces with little or no vegetation   
23         Open spaces with little or no vegetation   
24                                  Inland wetlands   
25                                  Inland wetlands   
26                                Maritime wetlands   
27                                    Inland waters   
28                                    Inland waters   
29                                    Marine waters   
30                                    Marine waters   

                                               LABEL3          RGB  
0                             Continuous urban fabric  230-000-077  
1                          Discontinuous urban fabric  255-000-000  
2                      Industrial or commercial units  204-077-242  
3          Road and rail networks and associated land  204-000-000  
4                                          Port areas  230-204-204  
5                                            Airports  230-204-230  
6                            Mineral extraction sites  166-000-204  
7                                          Dump sites  166-077-000  
8                                  Construction sites  255-077-255  
9                                   Green urban areas  255-166-255  
10                       Sport and leisure facilities  255-230-255  
11                          Non-irrigated arable land  255-255-168  
12                  Fruit trees and berry plantations  242-166-077  
13                                           Pastures  230-230-077  
14                       Complex cultivation patterns  255-230-077  
15  Land principally occupied by agriculture, with...  230-204-077  
16                                Broad-leaved forest  128-255-000  
17                                  Coniferous forest  000-166-000  
18                                       Mixed forest  077-255-000  
19                                 Natural grasslands  204-242-077  
20                                Moors and heathland  166-255-128  
21                        Transitional woodland-shrub  166-242-000  
22                              Beaches, dunes, sands  230-230-230  
23                           Sparsely vegetated areas  204-255-204  
24                                     Inland marshes  166-166-255  
25                                          Peat bogs  077-077-255  
26                                       Salt marshes  204-204-255  
27                                      Water courses  000-204-242  
28                                       Water bodies  128-242-230  
29                                    Coastal lagoons  000-255-166  
30                                      Sea and ocean  230-242-255  

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')

In [9]: lulc.sample(10)
Out[9]: 
    code_12          id remark      area_ha    shape_area  clc_int Landuse  \
97      242  EU-2032632   None   320.561233  3.205612e+06      242    AGRL   
242     511  EU-2057279   None   680.904882  6.809048e+06      511    WATR   
176     313  EU-2046188   None  1361.557146  1.361557e+07      313    FRST   
123     243  EU-2034298   None   144.695370  1.446954e+06      243    AGRL   
131     243  EU-2034362   None   137.819209  1.378192e+06      243    AGRL   
204     313  EU-2046436   None   178.733078  1.787331e+06      313    FRST   
34      211  EU-2026736   None    64.684476  6.468448e+05      211    AGRL   
206     321  EU-2050766   None   158.484669  1.584847e+06      321    RNGE   
221     324  EU-2052075   None    27.506101  2.750610e+05      324    RNGB   
87      242  EU-2032505   None    72.501812  7.250181e+05      242    AGRL   

                                              geometry  CLC_CODE  \
97   MULTIPOLYGON (((658432.374 6469848.151, 658267...       242   
242  POLYGON ((662550.899 6469623.451, 662525.245 6...       511   
176  POLYGON ((650371.435 6452851.995, 650381.372 6...       313   
123  POLYGON ((661923.007 6455403.736, 662145.620 6...       243   
131  POLYGON ((659379.808 6459999.493, 659243.178 6...       243   
204  POLYGON ((661448.216 6462739.348, 661435.432 6...       313   
34   POLYGON ((661942.121 6457768.994, 661968.303 6...       211   
206  POLYGON ((662183.682 6460761.495, 662185.931 6...       321   
221  POLYGON ((652025.110 6457934.129, 652349.760 6...       324   
87   POLYGON ((657763.169 6460603.043, 657761.623 6...       242   

                            LABEL1  \
97              Agricultural areas   
242                   Water bodies   
176  Forest and semi natural areas   
123             Agricultural areas   
131             Agricultural areas   
204  Forest and semi natural areas   
34              Agricultural areas   
206  Forest and semi natural areas   
221  Forest and semi natural areas   
87              Agricultural areas   

                                              LABEL2  \
97                  Heterogeneous agricultural areas   
242                                    Inland waters   
176                                          Forests   
123                 Heterogeneous agricultural areas   
131                 Heterogeneous agricultural areas   
204                                          Forests   
34                                       Arable land   
206  Scrub and/or herbaceous vegetation associations   
221  Scrub and/or herbaceous vegetation associations   
87                  Heterogeneous agricultural areas   

                                                LABEL3          RGB  
97                        Complex cultivation patterns  255-230-077  
242                                      Water courses  000-204-242  
176                                       Mixed forest  077-255-000  
123  Land principally occupied by agriculture, with...  230-204-077  
131  Land principally occupied by agriculture, with...  230-204-077  
204                                       Mixed forest  077-255-000  
34                           Non-irrigated arable land  255-255-168  
206                                 Natural grasslands  204-242-077  
221                        Transitional woodland-shrub  166-242-000  
87                        Complex cultivation patterns  255-230-077  
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]: 
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')
  • Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns
In [12]: lulc = lulc[selected_cols]

# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]: 
    Landuse                                           LABEL2  \
210    RNGB  Scrub and/or herbaceous vegetation associations   
176    FRST                                          Forests   
71     AGRL                 Heterogeneous agricultural areas   
127    AGRL                 Heterogeneous agricultural areas   
213    RNGB  Scrub and/or herbaceous vegetation associations   
120    AGRL                 Heterogeneous agricultural areas   
156    FRSE                                          Forests   
175    FRST                                          Forests   
140    FRSD                                          Forests   
155    FRSE                                          Forests   

                                              geometry  
210  POLYGON ((653587.374 6450466.487, 653427.242 6...  
176  POLYGON ((650371.435 6452851.995, 650381.372 6...  
71   POLYGON ((659371.432 6449378.989, 659096.054 6...  
127  POLYGON ((661595.757 6457878.711, 661567.866 6...  
213  POLYGON ((664367.934 6450738.987, 664230.617 6...  
120  POLYGON ((655066.992 6455944.494, 655058.496 6...  
156  POLYGON ((666014.039 6456928.579, 666009.484 6...  
175  POLYGON ((659371.432 6449378.989, 659362.117 6...  
140  POLYGON ((664529.993 6449567.988, 664562.435 6...  
155  POLYGON ((663003.119 6457774.497, 663041.741 6...  

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.

  • Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"

# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
  • Let’s make sure that the coordinate reference system of the layers are identical

# Check the crs of landuse
In [16]: lulc.crs
Out[16]: {'init': 'epsg:3301'}

# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]: {'init': 'epsg:3301'}

# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

  • Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")

# Let's check the result
In [20]: join.head()
Out[20]: 
     OBJECTID        LIIK               NIMI   EXT_SYST_I    KKR_KOOD  \
126    144115  taimed III  läikiv kurdsirbik  -1324418359  KLO9400024   
132    144202  taimed III    siberi võhumõõk   1344032932  KLO9312850   
157    151696  taimed III     balti sõrmkäpp   -376320786  KLO9317690   
201    154300  taimed III  harilik käoraamat   -403691661  KLO9303449   
257    154692  taimed III       suur käopõll   1784194494  KLO9303464   

    PRIV_TYYP       STAATUS  IMPORT  LAADIMISKP  \
126    Avalik  kontrollitud       0  2018-10-29   
132    Avalik  kontrollitud       0  2018-10-29   
157    Avalik  kontrollitud       0  2018-10-29   
201    Avalik  kontrollitud       0  2018-10-29   
257    Avalik  kontrollitud       0  2018-10-29   

                           geometry  index_right Landuse  \
126  POINT (652641.854 6458099.088)          224    RNGB   
132  POINT (660367.911 6469599.134)          143    FRSD   
157  POINT (656609.812 6459582.666)          201    FRST   
201  POINT (657356.897 6459780.112)          201    FRST   
257  POINT (657291.854 6459783.092)          201    FRST   

                                              LABEL2  
126  Scrub and/or herbaceous vegetation associations  
132                                          Forests  
157                                          Forests  
201                                          Forests  
257                                          Forests  

Awesome! Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

  • Let’s save this layer into a new Shapefile

# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"

# Save to disk
join.to_file(outfp)

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

In [21]: join['NIMI'].value_counts()
Out[21]: 
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
siberi võhumõõk            2
sulgjas õhik               2
harilik käoraamat          2
suur-rabakiil              1
kodukakk                   1
vingerjas                  1
valgelaup-rabakiil         1
hink                       1
Veski                      1
läikiv kurdsirbik          1
hall käpp                  1
kuradi-sõrmkäpp            1
ohakasoomukas              1
roomav öövilge             1
suur-kuldtiib              1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]: 
Forests                                            52
Heterogeneous agricultural areas                   22
Scrub and/or herbaceous vegetation associations    22
Pastures                                            4
Inland waters                                       2
Urban fabric                                        2
Arable land                                         2
Inland wetlands                                     1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []

In [24]: for species_id, species_group in join.groupby('NIMI'):
   ....:     lulc_count = species_group['LABEL2'].value_counts()
   ....:     top = lulc_count.head(1)
   ....:     data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
   ....: 

# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)

# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]: 
                species_id  all_sights  \
18         valge-toonekurg          20   
2           balti sõrmkäpp          20   
6   kahkjaspunane sõrmkäpp          17   
15            suur käopõll          16   
1             aas-karukell           6   
13            soo-neiuvaip           5   
21       vööthuul-sõrmkäpp           5   
4        harilik käoraamat           2   
12         siberi võhumõõk           2   
14            sulgjas õhik           2   

                                           top_lulc  sights_in_top  
18                 Heterogeneous agricultural areas             15  
2                                           Forests             12  
6                                           Forests             12  
15                                          Forests             14  
1                                           Forests              4  
13  Scrub and/or herbaceous vegetation associations              3  
21                                          Forests              3  
4   Scrub and/or herbaceous vegetation associations              1  
12                 Heterogeneous agricultural areas              1  
14  Scrub and/or herbaceous vegetation associations              1  

Launch in the web/MyBinder:

https://mybinder.org/badge_logo.svg