A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.
Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, and even Shapely basics thus you should know how they work.
Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.
Download and clean the data
For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa:
You should now have the files listed above in your Data folder.
Let’s read the data into geopandas and see what we have.
import geopandas as gpd# Filepathfp ="../files/data/L3/porijogi_corine_landuse.shp"# Read the datalulc = gpd.read_file(fp)# See the first rowsdisplay(lulc.head(5))
code_12
id
remark
area_ha
shape_area
clc_int
Landuse
geometry
0
112
EU-2024407
None
67.055321
670553.20630
112
URML
POLYGON ((658854.791 6458244.203, 658826.936 6...
1
112
EU-2024418
None
36.452500
364525.00295
112
URML
POLYGON ((663553.865 6459840.806, 663570.622 6...
2
112
EU-2024426
None
33.525145
335251.45070
112
URML
POLYGON ((659006.349 6463680.667, 659031.241 6...
3
112
EU-2024431
None
30.111925
301119.25420
112
URML
POLYGON ((658401.027 6466951.556, 658518.679 6...
4
112
EU-2024434
None
70.979465
709794.64765
112
URML
POLYGON ((658348.31 6467489.99, 658040.31 6467...
We have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table:
In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.
import pandas as pdcodes = pd.read_csv('../files/data/L3/corine_landuse_codes.csv', sep=';')display(codes.head(5))
CLC_CODE
LABEL1
LABEL2
LABEL3
RGB
0
111
Artificial surfaces
Urban fabric
Continuous urban fabric
230-000-077
1
112
Artificial surfaces
Urban fabric
Discontinuous urban fabric
255-000-000
2
121
Artificial surfaces
Industrial, commercial and transport units
Industrial or commercial units
204-077-242
3
122
Artificial surfaces
Industrial, commercial and transport units
Road and rail networks and associated land
204-000-000
4
123
Artificial surfaces
Industrial, commercial and transport units
Port areas
230-204-204
This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:
# general pandas table joinlulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')display(lulc.sample(5))
code_12
id
remark
area_ha
shape_area
clc_int
Landuse
geometry
CLC_CODE
LABEL1
LABEL2
LABEL3
RGB
188
313
EU-2046278
None
3186.428951
3.186429e+07
313
FRST
POLYGON ((658183.498 6456106.487, 658272.934 6...
313
Forest and semi natural areas
Forests
Mixed forest
077-255-000
231
324
EU-2052212
None
77.935898
7.793590e+05
324
RNGB
POLYGON ((657444.378 6462063.825, 657491.436 6...
324
Forest and semi natural areas
Scrub and/or herbaceous vegetation associations
Transitional woodland-shrub
166-242-000
139
243
EU-2034520
None
77.763526
7.776353e+05
243
AGRL
POLYGON ((663034.378 6468433.165, 663034.37 64...
243
Agricultural areas
Heterogeneous agricultural areas
Land principally occupied by agriculture, with...
230-204-077
87
242
EU-2032505
None
72.501812
7.250181e+05
242
AGRL
POLYGON ((657763.169 6460603.043, 657761.623 6...
242
Agricultural areas
Heterogeneous agricultural areas
Complex cultivation patterns
255-230-077
18
211
EU-2026642
None
50.555051
5.055505e+05
211
AGRL
POLYGON ((657322.374 6452986.996, 657304.188 6...
211
Agricultural areas
Arable land
Non-irrigated arable land
255-255-168
# See the column names and confirm that we now have a column called# LABEL2, that gives us some textual description for the landuse codesdisplay(lulc.columns)
Let’s also get rid of the unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry
# Columns that will be selectedselected_cols = ['Landuse', 'LABEL2','geometry']# Select those columns, we intentional assign to and overwrite the same variablelulc = lulc[selected_cols]# Let's see 5 randomly sampled rowsdisplay(lulc.sample(5))
Landuse
LABEL2
geometry
197
FRST
Forests
POLYGON ((653411.245 6459903.494, 653413.623 6...
143
FRSD
Forests
POLYGON ((660625.437 6469578.988, 660496.869 6...
243
WATR
Inland waters
POLYGON ((651634.186 6453417.991, 651530.499 6...
29
AGRL
Arable land
POLYGON ((665689.415 6456319.842, 665710.118 6...
66
PAST
Pastures
POLYGON ((653454.668 6461601.502, 653436.373 6...
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers
Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file:
Read the category_3_species_porijogi.gpkg layer into memory
# protected species under class 3 monitoring sightingsspecies_fp ="../files/data/L3/category_3_species_porijogi.gpkg"# Read data species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
/usr/lib64/python3.13/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver GPKG does not support open option DRIVER
return ogr_read(
Let’s make sure that the coordinate reference system of the layers are identical:
# Check the crs of landuseprint(lulc.crs)# Check the crs of species layer in case we need to reproject the# geometries to make them comparableprint(species.crs)# Do they match? - We can test thatprint(f"Do they match: {lulc.crs == species.crs} ")
EPSG:3301
EPSG:3301
Do they match: True
They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function
# Make a spatial joinjoined = gpd.sjoin(species, lulc, how="inner", predicate="within")# Let's check the resultdisplay(joined.head(5))
OBJECTID
LIIK
NIMI
EXT_SYST_I
KKR_KOOD
PRIV_TYYP
STAATUS
IMPORT
LAADIMISKP
geometry
index_right
Landuse
LABEL2
126
144115
taimed III
läikiv kurdsirbik
-1324418359
KLO9400024
Avalik
kontrollitud
0
2018-10-29
POINT (652641.854 6458099.088)
224
RNGB
Scrub and/or herbaceous vegetation associations
132
144202
taimed III
siberi võhumõõk
1344032932
KLO9312850
Avalik
kontrollitud
0
2018-10-29
POINT (660367.911 6469599.134)
143
FRSD
Forests
157
151696
taimed III
balti sõrmkäpp
-376320786
KLO9317690
Avalik
kontrollitud
0
2018-10-29
POINT (656609.812 6459582.666)
201
FRST
Forests
169
154103
taimed III
aas-karukell
979547047
KLO9312788
Avalik
kontrollitud
0
2018-10-29
POINT (651410.669 6452555.366)
176
FRST
Forests
175
154114
taimed III
aas-karukell
2091545307
KLO9303437
Avalik
kontrollitud
0
2018-10-29
POINT (651272.143 6452333.26)
108
AGRL
Heterogeneous agricultural areas
Well done. Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.
We could save this layer into a file and review it in QGIS:
# Output pathoutfp ="../files/data/L3/landuse_per_species.gpkg"# Save to diskjoined.to_file(outfp, driver="GPKG", layer='landuse_per_species')
Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:
# how many times each species is represented in the datadisplay(joined['NIMI'].value_counts())
NIMI
balti sõrmkäpp 20
valge-toonekurg 20
kahkjaspunane sõrmkäpp 17
suur käopõll 16
aas-karukell 6
soo-neiuvaip 5
vööthuul-sõrmkäpp 5
siberi võhumõõk 2
sulgjas õhik 2
harilik käoraamat 2
läikiv kurdsirbik 1
roomav öövilge 1
suur-kuldtiib 1
hall käpp 1
vingerjas 1
ohakasoomukas 1
kodukakk 1
hink 1
Veski 1
kuradi-sõrmkäpp 1
valgelaup-rabakiil 1
suur-rabakiil 1
Name: count, dtype: int64
# how many times each landuse type is represented in the datadisplay(joined['LABEL2'].value_counts())
Those values by themselves only give a cursorily view of the data. Let’s make some meaningful aggregations.
# initialise empty listdata_list = []for species_id, species_group in joined.groupby('NIMI'): lulc_count = species_group['LABEL2'].value_counts() top = lulc_count.head(1)# add info to list data_row_as_dict = {'species_id':species_id,'all_sights': len(species_group),'top_lulc': top.index[0],'sights_in_top': top[0] } data_list.append(data_row_as_dict)# Creates DataFrame from now filled list with data items.top_sights = pd.DataFrame(data_list)# Print the datatop_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(5)
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_281/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
species_id
all_sights
top_lulc
sights_in_top
18
valge-toonekurg
20
Heterogeneous agricultural areas
15
2
balti sõrmkäpp
20
Forests
12
6
kahkjaspunane sõrmkäpp
17
Forests
12
15
suur käopõll
16
Forests
14
1
aas-karukell
6
Forests
4
The “valge-toonekurg” (white stork) is the most sighted bird species in the data, and 15 out of 20 sights are in the “Heterogeneous agricultural areas” landuse type.