A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.
Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, and even Shapely basics thus you should know how they work.
Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.
Download and clean the data
For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa:
You should now have the files listed above in your Data folder.
Let’s read the data into geopandas and see what we have.
import geopandas as gpd# Filepathfp ="../files/data/L3/porijogi_corine_landuse.shp"# Read the datalulc = gpd.read_file(fp)# See the first rowsdisplay(lulc.head(5))
code_12
id
remark
area_ha
shape_area
clc_int
Landuse
geometry
0
112
EU-2024407
None
67.055321
670553.20630
112
URML
POLYGON ((658854.791 6458244.203, 658826.936 6...
1
112
EU-2024418
None
36.452500
364525.00295
112
URML
POLYGON ((663553.865 6459840.806, 663570.622 6...
2
112
EU-2024426
None
33.525145
335251.45070
112
URML
POLYGON ((659006.349 6463680.667, 659031.241 6...
3
112
EU-2024431
None
30.111925
301119.25420
112
URML
POLYGON ((658401.027 6466951.556, 658518.679 6...
4
112
EU-2024434
None
70.979465
709794.64765
112
URML
POLYGON ((658348.310 6467489.990, 658040.310 6...
We have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table:
In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.
import pandas as pdcodes = pd.read_csv('../files/data/L3/corine_landuse_codes.csv', sep=';')display(codes.head(5))
CLC_CODE
LABEL1
LABEL2
LABEL3
RGB
0
111
Artificial surfaces
Urban fabric
Continuous urban fabric
230-000-077
1
112
Artificial surfaces
Urban fabric
Discontinuous urban fabric
255-000-000
2
121
Artificial surfaces
Industrial, commercial and transport units
Industrial or commercial units
204-077-242
3
122
Artificial surfaces
Industrial, commercial and transport units
Road and rail networks and associated land
204-000-000
4
123
Artificial surfaces
Industrial, commercial and transport units
Port areas
230-204-204
This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:
# general pandas table joinlulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')display(lulc.sample(5))
code_12
id
remark
area_ha
shape_area
clc_int
Landuse
geometry
CLC_CODE
LABEL1
LABEL2
LABEL3
RGB
160
312
EU-2041277
None
32.806670
3.280667e+05
312
FRSE
POLYGON ((660788.430 6461693.995, 660742.430 6...
312
Forest and semi natural areas
Forests
Coniferous forest
000-166-000
243
512
EU-2057340
None
99.728388
9.972839e+05
512
WATR
POLYGON ((651634.186 6453417.991, 651530.499 6...
512
Water bodies
Inland waters
Water bodies
128-242-230
202
313
EU-2046379
None
99.459243
9.945924e+05
313
FRST
POLYGON ((662385.116 6458010.496, 662342.248 6...
313
Forest and semi natural areas
Forests
Mixed forest
077-255-000
163
312
EU-2041300
None
194.838575
1.948386e+06
312
FRSE
POLYGON ((657890.867 6463382.496, 657940.056 6...
312
Forest and semi natural areas
Forests
Coniferous forest
000-166-000
220
324
EU-2052064
None
29.938636
2.993864e+05
324
RNGB
POLYGON ((657621.023 6456368.367, 657599.412 6...
324
Forest and semi natural areas
Scrub and/or herbaceous vegetation associations
Transitional woodland-shrub
166-242-000
# See the column names and confirm that we now have a column called# LABEL2, that gives us some textual description for the landuse codesdisplay(lulc.columns)
Let’s also get rid of the unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry
# Columns that will be selectedselected_cols = ['Landuse', 'LABEL2','geometry']# Select those columns, we intentional assign to and overwrite the same variablelulc = lulc[selected_cols]# Let's see 5 randomly sampled rowsdisplay(lulc.sample(5))
Landuse
LABEL2
geometry
127
AGRL
Heterogeneous agricultural areas
POLYGON ((661595.757 6457878.711, 661567.866 6...
53
AGRL
Arable land
MULTIPOLYGON (((665684.835 6458288.958, 665670...
187
FRST
Forests
POLYGON ((660393.970 6455560.175, 660412.560 6...
97
AGRL
Heterogeneous agricultural areas
MULTIPOLYGON (((658432.374 6469848.151, 658267...
27
AGRL
Arable land
POLYGON ((661923.007 6455403.736, 661884.498 6...
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers
Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file:
Read the category_3_species_porijogi.gpkg layer into memory
# protected species under class 3 monitoring sightingsspecies_fp ="../files/data/L3/category_3_species_porijogi.gpkg"# Read data species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
Let’s make sure that the coordinate reference system of the layers are identical:
# Check the crs of landuseprint(lulc.crs)# Check the crs of species layer in case we need to reproject the# geometries to make them comparableprint(species.crs)# Do they match? - We can test thatprint(f"Do they match: {lulc.crs == species.crs} ")
EPSG:3301
EPSG:3301
Do they match: True
They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function
# Make a spatial joinjoined = gpd.sjoin(species, lulc, how="inner", predicate="within")# Let's check the resultdisplay(joined.head(5))
OBJECTID
LIIK
NIMI
EXT_SYST_I
KKR_KOOD
PRIV_TYYP
STAATUS
IMPORT
LAADIMISKP
geometry
index_right
Landuse
LABEL2
126
144115
taimed III
läikiv kurdsirbik
-1324418359
KLO9400024
Avalik
kontrollitud
0
2018-10-29
POINT (652641.854 6458099.088)
224
RNGB
Scrub and/or herbaceous vegetation associations
132
144202
taimed III
siberi võhumõõk
1344032932
KLO9312850
Avalik
kontrollitud
0
2018-10-29
POINT (660367.911 6469599.134)
143
FRSD
Forests
157
151696
taimed III
balti sõrmkäpp
-376320786
KLO9317690
Avalik
kontrollitud
0
2018-10-29
POINT (656609.812 6459582.666)
201
FRST
Forests
201
154300
taimed III
harilik käoraamat
-403691661
KLO9303449
Avalik
kontrollitud
0
2018-10-29
POINT (657356.897 6459780.112)
201
FRST
Forests
257
154692
taimed III
suur käopõll
1784194494
KLO9303464
Avalik
kontrollitud
0
2018-10-29
POINT (657291.854 6459783.092)
201
FRST
Forests
Well done. Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.
We could save this layer into a file and review it in QGIS:
# Output pathoutfp ="../files/data/L3/landuse_per_species.gpkg"# Save to diskjoined.to_file(outfp, driver="GPKG", layer='landuse_per_species')
Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:
# how many times each species is represented in the datadisplay(joined['NIMI'].value_counts())
NIMI
balti sõrmkäpp 20
valge-toonekurg 20
kahkjaspunane sõrmkäpp 17
suur käopõll 16
aas-karukell 6
soo-neiuvaip 5
vööthuul-sõrmkäpp 5
sulgjas õhik 2
harilik käoraamat 2
siberi võhumõõk 2
läikiv kurdsirbik 1
valgelaup-rabakiil 1
Veski 1
kodukakk 1
hink 1
vingerjas 1
kuradi-sõrmkäpp 1
ohakasoomukas 1
suur-kuldtiib 1
roomav öövilge 1
hall käpp 1
suur-rabakiil 1
Name: count, dtype: int64
# how many times each landuse type is represented in the datadisplay(joined['LABEL2'].value_counts())
Those values by themselves only give a cursorily view of the data. Let’s make some meaningful aggregations.
# initialise empty listdata_list = []for species_id, species_group in joined.groupby('NIMI'): lulc_count = species_group['LABEL2'].value_counts() top = lulc_count.head(1)# add info to list data_row_as_dict = {'species_id':species_id,'all_sights': len(species_group),'top_lulc': top.index[0],'sights_in_top': top[0] } data_list.append(data_row_as_dict)# Creates DataFrame from now filled list with data items.top_sights = pd.DataFrame(data_list)# Print the datatop_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(5)
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
'sights_in_top': top[0] }
species_id
all_sights
top_lulc
sights_in_top
18
valge-toonekurg
20
Heterogeneous agricultural areas
15
2
balti sõrmkäpp
20
Forests
12
6
kahkjaspunane sõrmkäpp
17
Forests
12
15
suur käopõll
16
Forests
14
1
aas-karukell
6
Forests
4
The “valge-toonekurk” (white stork) is the most sighted bird species in the data, and 15 out of 20 sights are in the “Heterogeneous agricultural areas” landuse type.