Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

"intersects"
"within"
"contains"

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, and even Shapely basics thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa:

Download file:porijogi_corine_landuse.zip
Unzip the file into into your L3 folder:

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

Let’s read the data into geopandas and see what we have.

import geopandas as gpd

# Filepath
fp = "../files/data/L3/porijogi_corine_landuse.shp"

# Read the data
lulc = gpd.read_file(fp)

# See the first rows
display(lulc.head(5))

	code_12	id	remark	area_ha	shape_area	clc_int	Landuse	geometry
0	112	EU-2024407	None	67.055321	670553.20630	112	URML	POLYGON ((658854.791 6458244.203, 658826.936 6...
1	112	EU-2024418	None	36.452500	364525.00295	112	URML	POLYGON ((663553.865 6459840.806, 663570.622 6...
2	112	EU-2024426	None	33.525145	335251.45070	112	URML	POLYGON ((659006.349 6463680.667, 659031.241 6...
3	112	EU-2024431	None	30.111925	301119.25420	112	URML	POLYGON ((658401.027 6466951.556, 658518.679 6...
4	112	EU-2024434	None	70.979465	709794.64765	112	URML	POLYGON ((658348.310 6467489.990, 658040.310 6...

We have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table:

file:corine_landuse_codes.csv

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

import pandas as pd

codes = pd.read_csv('../files/data/L3/corine_landuse_codes.csv', sep=';')
display(codes.head(5))

	CLC_CODE	LABEL1	LABEL2	LABEL3	RGB
0	111	Artificial surfaces	Urban fabric	Continuous urban fabric	230-000-077
1	112	Artificial surfaces	Urban fabric	Discontinuous urban fabric	255-000-000
2	121	Artificial surfaces	Industrial, commercial and transport units	Industrial or commercial units	204-077-242
3	122	Artificial surfaces	Industrial, commercial and transport units	Road and rail networks and associated land	204-000-000
4	123	Artificial surfaces	Industrial, commercial and transport units	Port areas	230-204-204

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

# general pandas table join
lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
display(lulc.sample(5))

	code_12	id	remark	area_ha	shape_area	clc_int	Landuse	geometry	CLC_CODE	LABEL1	LABEL2	LABEL3	RGB
160	312	EU-2041277	None	32.806670	3.280667e+05	312	FRSE	POLYGON ((660788.430 6461693.995, 660742.430 6...	312	Forest and semi natural areas	Forests	Coniferous forest	000-166-000
243	512	EU-2057340	None	99.728388	9.972839e+05	512	WATR	POLYGON ((651634.186 6453417.991, 651530.499 6...	512	Water bodies	Inland waters	Water bodies	128-242-230
202	313	EU-2046379	None	99.459243	9.945924e+05	313	FRST	POLYGON ((662385.116 6458010.496, 662342.248 6...	313	Forest and semi natural areas	Forests	Mixed forest	077-255-000
163	312	EU-2041300	None	194.838575	1.948386e+06	312	FRSE	POLYGON ((657890.867 6463382.496, 657940.056 6...	312	Forest and semi natural areas	Forests	Coniferous forest	000-166-000
220	324	EU-2052064	None	29.938636	2.993864e+05	324	RNGB	POLYGON ((657621.023 6456368.367, 657599.412 6...	324	Forest and semi natural areas	Scrub and/or herbaceous vegetation associations	Transitional woodland-shrub	166-242-000

# See the column names and confirm that we now have a column called
# LABEL2, that gives us some textual description for the landuse codes
display(lulc.columns)

Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')

Let’s also get rid of the unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be selected
selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns, we intentional assign to and overwrite the same variable
lulc = lulc[selected_cols]

# Let's see 5 randomly sampled rows
display(lulc.sample(5))

	Landuse	LABEL2	geometry
127	AGRL	Heterogeneous agricultural areas	POLYGON ((661595.757 6457878.711, 661567.866 6...
53	AGRL	Arable land	MULTIPOLYGON (((665684.835 6458288.958, 665670...
187	FRST	Forests	POLYGON ((660393.970 6455560.175, 660412.560 6...
97	AGRL	Heterogeneous agricultural areas	MULTIPOLYGON (((658432.374 6469848.151, 658267...
27	AGRL	Arable land	POLYGON ((661923.007 6455403.736, 661884.498 6...

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file:

category_3_species_porijogi.gpkg

Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
species_fp = "../files/data/L3/category_3_species_porijogi.gpkg"

# Read data 
species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')

Let’s make sure that the coordinate reference system of the layers are identical:

# Check the crs of landuse
print(lulc.crs)

# Check the crs of species layer in case we need to reproject the
# geometries to make them comparable
print(species.crs)

# Do they match? - We can test that
print(f"Do they match: {lulc.crs == species.crs} ")

EPSG:3301
EPSG:3301
Do they match: True

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
joined = gpd.sjoin(species, lulc, how="inner", predicate="within")

# Let's check the result
display(joined.head(5))

	OBJECTID	LIIK	NIMI	EXT_SYST_I	KKR_KOOD	PRIV_TYYP	STAATUS	LAADIMISKP	geometry	index_right	Landuse	LABEL2
126	144115	taimed III	läikiv kurdsirbik	-1324418359	KLO9400024	Avalik	kontrollitud	2018-10-29	POINT (652641.854 6458099.088)	224	RNGB	Scrub and/or herbaceous vegetation associations
132	144202	taimed III	siberi võhumõõk	1344032932	KLO9312850	Avalik	kontrollitud	2018-10-29	POINT (660367.911 6469599.134)	143	FRSD	Forests
157	151696	taimed III	balti sõrmkäpp	-376320786	KLO9317690	Avalik	kontrollitud	2018-10-29	POINT (656609.812 6459582.666)	201	FRST	Forests
201	154300	taimed III	harilik käoraamat	-403691661	KLO9303449	Avalik	kontrollitud	2018-10-29	POINT (657356.897 6459780.112)	201	FRST	Forests
257	154692	taimed III	suur käopõll	1784194494	KLO9303464	Avalik	kontrollitud	2018-10-29	POINT (657291.854 6459783.092)	201	FRST	Forests

Well done. Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

We could save this layer into a file and review it in QGIS:

# Output path
outfp = "../files/data/L3/landuse_per_species.gpkg"

# Save to disk
joined.to_file(outfp, driver="GPKG", layer='landuse_per_species')

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

# how many times each species is represented in the data
display(joined['NIMI'].value_counts())

NIMI
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
sulgjas õhik               2
harilik käoraamat          2
siberi võhumõõk            2
läikiv kurdsirbik          1
valgelaup-rabakiil         1
Veski                      1
kodukakk                   1
hink                       1
vingerjas                  1
kuradi-sõrmkäpp            1
ohakasoomukas              1
suur-kuldtiib              1
roomav öövilge             1
hall käpp                  1
suur-rabakiil              1
Name: count, dtype: int64

# how many times each landuse type is represented in the data
display(joined['LABEL2'].value_counts())

LABEL2
Forests                                            52
Scrub and/or herbaceous vegetation associations    22
Heterogeneous agricultural areas                   22
Pastures                                            4
Urban fabric                                        2
Inland waters                                       2
Arable land                                         2
Inland wetlands                                     1
Name: count, dtype: int64

Those values by themselves only give a cursorily view of the data. Let’s make some meaningful aggregations.

# initialise empty list
data_list = []

for species_id, species_group in joined.groupby('NIMI'):
    lulc_count = species_group['LABEL2'].value_counts()
    top = lulc_count.head(1)
    # add info to list
    data_row_as_dict = {'species_id':species_id,
                        'all_sights': len(species_group),
                        'top_lulc': top.index[0],
                        'sights_in_top': top[0] }
    data_list.append(data_row_as_dict)

# Creates DataFrame from now filled list with data items.
top_sights = pd.DataFrame(data_list)

# Print the data
top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(5)

/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }

	species_id	all_sights	top_lulc	sights_in_top
18	valge-toonekurg	20	Heterogeneous agricultural areas	15
2	balti sõrmkäpp	20	Forests	12
6	kahkjaspunane sõrmkäpp	17	Forests	12
15	suur käopõll	16	Forests	14
1	aas-karukell	6	Forests	4

The “valge-toonekurk” (white stork) is the most sighted bird species in the data, and 15 out of 20 sights are in the “Heterogeneous agricultural areas” landuse type.

Download the notebook:

file:spatial-join.ipynb

Launch in the web/MyBinder: