Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, and even Shapely basics thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa:

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

Let’s read the data into geopandas and see what we have.

import geopandas as gpd

# Filepath
fp = "../files/data/L3/porijogi_corine_landuse.shp"

# Read the data
lulc = gpd.read_file(fp)

# See the first rows
display(lulc.head(5))
code_12 id remark area_ha shape_area clc_int Landuse geometry
0 112 EU-2024407 None 67.055321 670553.20630 112 URML POLYGON ((658854.791 6458244.203, 658826.936 6...
1 112 EU-2024418 None 36.452500 364525.00295 112 URML POLYGON ((663553.865 6459840.806, 663570.622 6...
2 112 EU-2024426 None 33.525145 335251.45070 112 URML POLYGON ((659006.349 6463680.667, 659031.241 6...
3 112 EU-2024431 None 30.111925 301119.25420 112 URML POLYGON ((658401.027 6466951.556, 658518.679 6...
4 112 EU-2024434 None 70.979465 709794.64765 112 URML POLYGON ((658348.31 6467489.99, 658040.31 6467...

We have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table:

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

import pandas as pd

codes = pd.read_csv('../files/data/L3/corine_landuse_codes.csv', sep=';')
display(codes.head(5))
CLC_CODE LABEL1 LABEL2 LABEL3 RGB
0 111 Artificial surfaces Urban fabric Continuous urban fabric 230-000-077
1 112 Artificial surfaces Urban fabric Discontinuous urban fabric 255-000-000
2 121 Artificial surfaces Industrial, commercial and transport units Industrial or commercial units 204-077-242
3 122 Artificial surfaces Industrial, commercial and transport units Road and rail networks and associated land 204-000-000
4 123 Artificial surfaces Industrial, commercial and transport units Port areas 230-204-204

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

# general pandas table join
lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
display(lulc.sample(5))
code_12 id remark area_ha shape_area clc_int Landuse geometry CLC_CODE LABEL1 LABEL2 LABEL3 RGB
188 313 EU-2046278 None 3186.428951 3.186429e+07 313 FRST POLYGON ((658183.498 6456106.487, 658272.934 6... 313 Forest and semi natural areas Forests Mixed forest 077-255-000
231 324 EU-2052212 None 77.935898 7.793590e+05 324 RNGB POLYGON ((657444.378 6462063.825, 657491.436 6... 324 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub 166-242-000
139 243 EU-2034520 None 77.763526 7.776353e+05 243 AGRL POLYGON ((663034.378 6468433.165, 663034.37 64... 243 Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with... 230-204-077
87 242 EU-2032505 None 72.501812 7.250181e+05 242 AGRL POLYGON ((657763.169 6460603.043, 657761.623 6... 242 Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns 255-230-077
18 211 EU-2026642 None 50.555051 5.055505e+05 211 AGRL POLYGON ((657322.374 6452986.996, 657304.188 6... 211 Agricultural areas Arable land Non-irrigated arable land 255-255-168
# See the column names and confirm that we now have a column called
# LABEL2, that gives us some textual description for the landuse codes
display(lulc.columns)
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')

Let’s also get rid of the unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be selected
selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns, we intentional assign to and overwrite the same variable
lulc = lulc[selected_cols]

# Let's see 5 randomly sampled rows
display(lulc.sample(5))
Landuse LABEL2 geometry
197 FRST Forests POLYGON ((653411.245 6459903.494, 653413.623 6...
143 FRSD Forests POLYGON ((660625.437 6469578.988, 660496.869 6...
243 WATR Inland waters POLYGON ((651634.186 6453417.991, 651530.499 6...
29 AGRL Arable land POLYGON ((665689.415 6456319.842, 665710.118 6...
66 PAST Pastures POLYGON ((653454.668 6461601.502, 653436.373 6...

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file:

Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
species_fp = "../files/data/L3/category_3_species_porijogi.gpkg"

# Read data 
species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
/usr/lib64/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver GPKG does not support open option DRIVER
  return ogr_read(

Let’s make sure that the coordinate reference system of the layers are identical:

# Check the crs of landuse
print(lulc.crs)

# Check the crs of species layer in case we need to reproject the
# geometries to make them comparable
print(species.crs)

# Do they match? - We can test that
print(f"Do they match: {lulc.crs == species.crs} ")
EPSG:3301
EPSG:3301
Do they match: True 

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
joined = gpd.sjoin(species, lulc, how="inner", predicate="within")

# Let's check the result
display(joined.head(5))
OBJECTID LIIK NIMI EXT_SYST_I KKR_KOOD PRIV_TYYP STAATUS IMPORT LAADIMISKP geometry index_right Landuse LABEL2
126 144115 taimed III läikiv kurdsirbik -1324418359 KLO9400024 Avalik kontrollitud 0 2018-10-29 POINT (652641.854 6458099.088) 224 RNGB Scrub and/or herbaceous vegetation associations
132 144202 taimed III siberi võhumõõk 1344032932 KLO9312850 Avalik kontrollitud 0 2018-10-29 POINT (660367.911 6469599.134) 143 FRSD Forests
157 151696 taimed III balti sõrmkäpp -376320786 KLO9317690 Avalik kontrollitud 0 2018-10-29 POINT (656609.812 6459582.666) 201 FRST Forests
169 154103 taimed III aas-karukell 979547047 KLO9312788 Avalik kontrollitud 0 2018-10-29 POINT (651410.669 6452555.366) 176 FRST Forests
175 154114 taimed III aas-karukell 2091545307 KLO9303437 Avalik kontrollitud 0 2018-10-29 POINT (651272.143 6452333.26) 108 AGRL Heterogeneous agricultural areas

Well done. Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

We could save this layer into a file and review it in QGIS:

# Output path
outfp = "../files/data/L3/landuse_per_species.gpkg"

# Save to disk
joined.to_file(outfp, driver="GPKG", layer='landuse_per_species')

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

# how many times each species is represented in the data
display(joined['NIMI'].value_counts())
NIMI
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
siberi võhumõõk            2
sulgjas õhik               2
harilik käoraamat          2
läikiv kurdsirbik          1
roomav öövilge             1
suur-kuldtiib              1
hall käpp                  1
vingerjas                  1
ohakasoomukas              1
kodukakk                   1
hink                       1
Veski                      1
kuradi-sõrmkäpp            1
valgelaup-rabakiil         1
suur-rabakiil              1
Name: count, dtype: int64
# how many times each landuse type is represented in the data
display(joined['LABEL2'].value_counts())
LABEL2
Forests                                            52
Scrub and/or herbaceous vegetation associations    22
Heterogeneous agricultural areas                   22
Pastures                                            4
Inland waters                                       2
Urban fabric                                        2
Arable land                                         2
Inland wetlands                                     1
Name: count, dtype: int64

Those values by themselves only give a cursorily view of the data. Let’s make some meaningful aggregations.

# initialise empty list
data_list = []

for species_id, species_group in joined.groupby('NIMI'):
    lulc_count = species_group['LABEL2'].value_counts()
    top = lulc_count.head(1)
    # add info to list
    data_row_as_dict = {'species_id':species_id,
                        'all_sights': len(species_group),
                        'top_lulc': top.index[0],
                        'sights_in_top': top[0] }
    data_list.append(data_row_as_dict)

# Creates DataFrame from now filled list with data items.
top_sights = pd.DataFrame(data_list)

# Print the data
top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(5)
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
species_id all_sights top_lulc sights_in_top
18 valge-toonekurg 20 Heterogeneous agricultural areas 15
2 balti sõrmkäpp 20 Forests 12
6 kahkjaspunane sõrmkäpp 17 Forests 12
15 suur käopõll 16 Forests 14
1 aas-karukell 6 Forests 4

The “valge-toonekurg” (white stork) is the most sighted bird species in the data, and 15 out of 20 sights are in the “Heterogeneous agricultural areas” landuse type.

Download the notebook:

file:spatial-join.ipynb

Launch in the web/MyBinder:

image