Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

"intersects"
"within"
"contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, and even Shapely basics thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa:

Download file:porijogi_corine_landuse.zip
Unzip the file into into your L3 folder:

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

Let’s read the data into geopandas and see what we have.

import geopandas as gpd

# Filepath
fp = "../files/data/L3/porijogi_corine_landuse.shp"

# Read the data
lulc = gpd.read_file(fp)

# See the first rows
display(lulc.head(5))

	code_12	id	remark	area_ha	shape_area	clc_int	Landuse	geometry
0	112	EU-2024407	None	67.055321	670553.20630	112	URML	POLYGON ((658854.791 6458244.203, 658826.936 6...
1	112	EU-2024418	None	36.452500	364525.00295	112	URML	POLYGON ((663553.865 6459840.806, 663570.622 6...
2	112	EU-2024426	None	33.525145	335251.45070	112	URML	POLYGON ((659006.349 6463680.667, 659031.241 6...
3	112	EU-2024431	None	30.111925	301119.25420	112	URML	POLYGON ((658401.027 6466951.556, 658518.679 6...
4	112	EU-2024434	None	70.979465	709794.64765	112	URML	POLYGON ((658348.31 6467489.99, 658040.31 6467...

We have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table:

file:corine_landuse_codes.csv

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

import pandas as pd

codes = pd.read_csv('../files/data/L3/corine_landuse_codes.csv', sep=';')
display(codes.head(5))

	CLC_CODE	LABEL1	LABEL2	LABEL3	RGB
0	111	Artificial surfaces	Urban fabric	Continuous urban fabric	230-000-077
1	112	Artificial surfaces	Urban fabric	Discontinuous urban fabric	255-000-000
2	121	Artificial surfaces	Industrial, commercial and transport units	Industrial or commercial units	204-077-242
3	122	Artificial surfaces	Industrial, commercial and transport units	Road and rail networks and associated land	204-000-000
4	123	Artificial surfaces	Industrial, commercial and transport units	Port areas	230-204-204

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

# general pandas table join
lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
display(lulc.sample(5))

	code_12	id	remark	area_ha	shape_area	clc_int	Landuse	geometry	CLC_CODE	LABEL1	LABEL2	LABEL3	RGB
188	313	EU-2046278	None	3186.428951	3.186429e+07	313	FRST	POLYGON ((658183.498 6456106.487, 658272.934 6...	313	Forest and semi natural areas	Forests	Mixed forest	077-255-000
231	324	EU-2052212	None	77.935898	7.793590e+05	324	RNGB	POLYGON ((657444.378 6462063.825, 657491.436 6...	324	Forest and semi natural areas	Scrub and/or herbaceous vegetation associations	Transitional woodland-shrub	166-242-000
139	243	EU-2034520	None	77.763526	7.776353e+05	243	AGRL	POLYGON ((663034.378 6468433.165, 663034.37 64...	243	Agricultural areas	Heterogeneous agricultural areas	Land principally occupied by agriculture, with...	230-204-077
87	242	EU-2032505	None	72.501812	7.250181e+05	242	AGRL	POLYGON ((657763.169 6460603.043, 657761.623 6...	242	Agricultural areas	Heterogeneous agricultural areas	Complex cultivation patterns	255-230-077
18	211	EU-2026642	None	50.555051	5.055505e+05	211	AGRL	POLYGON ((657322.374 6452986.996, 657304.188 6...	211	Agricultural areas	Arable land	Non-irrigated arable land	255-255-168

# See the column names and confirm that we now have a column called
# LABEL2, that gives us some textual description for the landuse codes
display(lulc.columns)

Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')

Let’s also get rid of the unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be selected
selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns, we intentional assign to and overwrite the same variable
lulc = lulc[selected_cols]

# Let's see 5 randomly sampled rows
display(lulc.sample(5))

	Landuse	LABEL2	geometry
197	FRST	Forests	POLYGON ((653411.245 6459903.494, 653413.623 6...
143	FRSD	Forests	POLYGON ((660625.437 6469578.988, 660496.869 6...
243	WATR	Inland waters	POLYGON ((651634.186 6453417.991, 651530.499 6...
29	AGRL	Arable land	POLYGON ((665689.415 6456319.842, 665710.118 6...
66	PAST	Pastures	POLYGON ((653454.668 6461601.502, 653436.373 6...

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file:

category_3_species_porijogi.gpkg

Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
species_fp = "../files/data/L3/category_3_species_porijogi.gpkg"

# Read data 
species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')

/usr/lib64/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver GPKG does not support open option DRIVER
  return ogr_read(

Let’s make sure that the coordinate reference system of the layers are identical:

# Check the crs of landuse
print(lulc.crs)

# Check the crs of species layer in case we need to reproject the
# geometries to make them comparable
print(species.crs)

# Do they match? - We can test that
print(f"Do they match: {lulc.crs == species.crs} ")

EPSG:3301
EPSG:3301
Do they match: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
joined = gpd.sjoin(species, lulc, how="inner", predicate="within")

# Let's check the result
display(joined.head(5))

	OBJECTID	LIIK	NIMI	EXT_SYST_I	KKR_KOOD	PRIV_TYYP	STAATUS	LAADIMISKP	geometry	index_right	Landuse	LABEL2
126	144115	taimed III	läikiv kurdsirbik	-1324418359	KLO9400024	Avalik	kontrollitud	2018-10-29	POINT (652641.854 6458099.088)	224	RNGB	Scrub and/or herbaceous vegetation associations
132	144202	taimed III	siberi võhumõõk	1344032932	KLO9312850	Avalik	kontrollitud	2018-10-29	POINT (660367.911 6469599.134)	143	FRSD	Forests
157	151696	taimed III	balti sõrmkäpp	-376320786	KLO9317690	Avalik	kontrollitud	2018-10-29	POINT (656609.812 6459582.666)	201	FRST	Forests
169	154103	taimed III	aas-karukell	979547047	KLO9312788	Avalik	kontrollitud	2018-10-29	POINT (651410.669 6452555.366)	176	FRST	Forests
175	154114	taimed III	aas-karukell	2091545307	KLO9303437	Avalik	kontrollitud	2018-10-29	POINT (651272.143 6452333.26)	108	AGRL	Heterogeneous agricultural areas

Well done. Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

We could save this layer into a file and review it in QGIS:

# Output path
outfp = "../files/data/L3/landuse_per_species.gpkg"

# Save to disk
joined.to_file(outfp, driver="GPKG", layer='landuse_per_species')

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

# how many times each species is represented in the data
display(joined['NIMI'].value_counts())

NIMI
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
siberi võhumõõk            2
sulgjas õhik               2
harilik käoraamat          2
läikiv kurdsirbik          1
roomav öövilge             1
suur-kuldtiib              1
hall käpp                  1
vingerjas                  1
ohakasoomukas              1
kodukakk                   1
hink                       1
Veski                      1
kuradi-sõrmkäpp            1
valgelaup-rabakiil         1
suur-rabakiil              1
Name: count, dtype: int64

# how many times each landuse type is represented in the data
display(joined['LABEL2'].value_counts())

LABEL2
Forests                                            52
Scrub and/or herbaceous vegetation associations    22
Heterogeneous agricultural areas                   22
Pastures                                            4
Inland waters                                       2
Urban fabric                                        2
Arable land                                         2
Inland wetlands                                     1
Name: count, dtype: int64

Those values by themselves only give a cursorily view of the data. Let’s make some meaningful aggregations.

# initialise empty list
data_list = []

for species_id, species_group in joined.groupby('NIMI'):
    lulc_count = species_group['LABEL2'].value_counts()
    top = lulc_count.head(1)
    # add info to list
    data_row_as_dict = {'species_id':species_id,
                        'all_sights': len(species_group),
                        'top_lulc': top.index[0],
                        'sights_in_top': top[0] }
    data_list.append(data_row_as_dict)

# Creates DataFrame from now filled list with data items.
top_sights = pd.DataFrame(data_list)

# Print the data
top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(5)

/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }

	species_id	all_sights	top_lulc	sights_in_top
18	valge-toonekurg	20	Heterogeneous agricultural areas	15
2	balti sõrmkäpp	20	Forests	12
6	kahkjaspunane sõrmkäpp	17	Forests	12
15	suur käopõll	16	Forests	14
1	aas-karukell	6	Forests	4

The “valge-toonekurg” (white stork) is the most sighted bird species in the data, and 15 out of 20 sights are in the “Heterogeneous agricultural areas” landuse type.

Download the notebook:

file:spatial-join.ipynb

Launch in the web/MyBinder: