Spatial join

A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.

The previous materials focused on learning how to perform a Point in Polygon query. We could now apply those techniques and create our own function to perform a spatial join between two layers based on their spatial relationship. We could for example join the attributes of a polygon layer into a point layer where each point would get the attributes of a polygon that contains the point.

Luckily, spatial joins (gpd.sjoin() -function) is already implemented in Geopandas, thus we do not need to create it ourselves. There are three possible types of join that can be applied in spatial join that are determined with op -parameter:

Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, and even Shapely basics thus you should know how they work.

Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.

Download and clean the data

For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa:

porijogi_corine_landuse.dbf  porijogi_corine_landuse.shp
porijogi_corine_landuse.prj  porijogi_corine_landuse.shx

You should now have the files listed above in your Data folder.

Let’s read the data into geopandas and see what we have.

import geopandas as gpd

# Filepath
fp = "../files/data/L3/porijogi_corine_landuse.shp"

# Read the data
lulc = gpd.read_file(fp)

# See the first rows
display(lulc.head(5))
code_12 id remark area_ha shape_area clc_int Landuse geometry
0 112 EU-2024407 None 67.055321 670553.20630 112 URML POLYGON ((658854.791 6458244.203, 658826.936 6...
1 112 EU-2024418 None 36.452500 364525.00295 112 URML POLYGON ((663553.865 6459840.806, 663570.622 6...
2 112 EU-2024426 None 33.525145 335251.45070 112 URML POLYGON ((659006.349 6463680.667, 659031.241 6...
3 112 EU-2024431 None 30.111925 301119.25420 112 URML POLYGON ((658401.027 6466951.556, 658518.679 6...
4 112 EU-2024434 None 70.979465 709794.64765 112 URML POLYGON ((658348.310 6467489.990, 658040.310 6...

We have multiple columns in the dataset but the most important one here is the column clc_int (corine landuse code) that tells the type of landuse cover under that polygon. Download the codes lookup table:

In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.

import pandas as pd

codes = pd.read_csv('../files/data/L3/corine_landuse_codes.csv', sep=';')
display(codes.head(5))
CLC_CODE LABEL1 LABEL2 LABEL3 RGB
0 111 Artificial surfaces Urban fabric Continuous urban fabric 230-000-077
1 112 Artificial surfaces Urban fabric Discontinuous urban fabric 255-000-000
2 121 Artificial surfaces Industrial, commercial and transport units Industrial or commercial units 204-077-242
3 122 Artificial surfaces Industrial, commercial and transport units Road and rail networks and associated land 204-000-000
4 123 Artificial surfaces Industrial, commercial and transport units Port areas 230-204-204

This table contains a field CLC_CODE which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int. We will now merge the lookup table codes (a Pandas dataframe) into our lulc GeoDataframe, based on the identifiers in the mentioned fields:

# general pandas table join
lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
display(lulc.sample(5))
code_12 id remark area_ha shape_area clc_int Landuse geometry CLC_CODE LABEL1 LABEL2 LABEL3 RGB
160 312 EU-2041277 None 32.806670 3.280667e+05 312 FRSE POLYGON ((660788.430 6461693.995, 660742.430 6... 312 Forest and semi natural areas Forests Coniferous forest 000-166-000
243 512 EU-2057340 None 99.728388 9.972839e+05 512 WATR POLYGON ((651634.186 6453417.991, 651530.499 6... 512 Water bodies Inland waters Water bodies 128-242-230
202 313 EU-2046379 None 99.459243 9.945924e+05 313 FRST POLYGON ((662385.116 6458010.496, 662342.248 6... 313 Forest and semi natural areas Forests Mixed forest 077-255-000
163 312 EU-2041300 None 194.838575 1.948386e+06 312 FRSE POLYGON ((657890.867 6463382.496, 657940.056 6... 312 Forest and semi natural areas Forests Coniferous forest 000-166-000
220 324 EU-2052064 None 29.938636 2.993864e+05 324 RNGB POLYGON ((657621.023 6456368.367, 657599.412 6... 324 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub 166-242-000
# See the column names and confirm that we now have a column called
# LABEL2, that gives us some textual description for the landuse codes
display(lulc.columns)
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
       'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
      dtype='object')

Let’s also get rid of the unnecessary columns by selecting only columns that we need i.e. Landuse, LABEL2 and geometry

# Columns that will be selected
selected_cols = ['Landuse', 'LABEL2','geometry']

# Select those columns, we intentional assign to and overwrite the same variable
lulc = lulc[selected_cols]

# Let's see 5 randomly sampled rows
display(lulc.sample(5))
Landuse LABEL2 geometry
127 AGRL Heterogeneous agricultural areas POLYGON ((661595.757 6457878.711, 661567.866 6...
53 AGRL Arable land MULTIPOLYGON (((665684.835 6458288.958, 665670...
187 FRST Forests POLYGON ((660393.970 6455560.175, 660412.560 6...
97 AGRL Heterogeneous agricultural areas MULTIPOLYGON (((658432.374 6469848.151, 658267...
27 AGRL Arable land POLYGON ((661923.007 6455403.736, 661884.498 6...

Now we have cleaned the data and have only those columns that we need for our analysis.

Join the layers

Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file:

Read the category_3_species_porijogi.gpkg layer into memory

# protected species under class 3 monitoring sightings
species_fp = "../files/data/L3/category_3_species_porijogi.gpkg"

# Read data 
species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')

Let’s make sure that the coordinate reference system of the layers are identical:

# Check the crs of landuse
print(lulc.crs)

# Check the crs of species layer in case we need to reproject the
# geometries to make them comparable
print(species.crs)

# Do they match? - We can test that
print(f"Do they match: {lulc.crs == species.crs} ")
EPSG:3301
EPSG:3301
Do they match: True 

They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.

Let’s now join the attributes from lulc (2) GeoDataFrame into species GeoDataFrame (1) by using gpd.sjoin() -function

# Make a spatial join
joined = gpd.sjoin(species, lulc, how="inner", predicate="within")

# Let's check the result
display(joined.head(5))
OBJECTID LIIK NIMI EXT_SYST_I KKR_KOOD PRIV_TYYP STAATUS IMPORT LAADIMISKP geometry index_right Landuse LABEL2
126 144115 taimed III läikiv kurdsirbik -1324418359 KLO9400024 Avalik kontrollitud 0 2018-10-29 POINT (652641.854 6458099.088) 224 RNGB Scrub and/or herbaceous vegetation associations
132 144202 taimed III siberi võhumõõk 1344032932 KLO9312850 Avalik kontrollitud 0 2018-10-29 POINT (660367.911 6469599.134) 143 FRSD Forests
157 151696 taimed III balti sõrmkäpp -376320786 KLO9317690 Avalik kontrollitud 0 2018-10-29 POINT (656609.812 6459582.666) 201 FRST Forests
201 154300 taimed III harilik käoraamat -403691661 KLO9303449 Avalik kontrollitud 0 2018-10-29 POINT (657356.897 6459780.112) 201 FRST Forests
257 154692 taimed III suur käopõll 1784194494 KLO9303464 Avalik kontrollitud 0 2018-10-29 POINT (657291.854 6459783.092) 201 FRST Forests

Well done. Now we have performed a successful spatial join where we got two new columns into our join GeoDataFrame, i.e. index_right that tells the index of the matching polygon in the lulc layer and species.

We could save this layer into a file and review it in QGIS:

# Output path
outfp = "../files/data/L3/landuse_per_species.gpkg"

# Save to disk
joined.to_file(outfp, driver="GPKG", layer='landuse_per_species')

Do the results make sense? Let’s evaluate this a bit by grouping and querying the resulting join for largest landuse type and species types combinations:

# how many times each species is represented in the data
display(joined['NIMI'].value_counts())
NIMI
balti sõrmkäpp            20
valge-toonekurg           20
kahkjaspunane sõrmkäpp    17
suur käopõll              16
aas-karukell               6
soo-neiuvaip               5
vööthuul-sõrmkäpp          5
sulgjas õhik               2
harilik käoraamat          2
siberi võhumõõk            2
läikiv kurdsirbik          1
valgelaup-rabakiil         1
Veski                      1
kodukakk                   1
hink                       1
vingerjas                  1
kuradi-sõrmkäpp            1
ohakasoomukas              1
suur-kuldtiib              1
roomav öövilge             1
hall käpp                  1
suur-rabakiil              1
Name: count, dtype: int64
# how many times each landuse type is represented in the data
display(joined['LABEL2'].value_counts())
LABEL2
Forests                                            52
Scrub and/or herbaceous vegetation associations    22
Heterogeneous agricultural areas                   22
Pastures                                            4
Urban fabric                                        2
Inland waters                                       2
Arable land                                         2
Inland wetlands                                     1
Name: count, dtype: int64

Those values by themselves only give a cursorily view of the data. Let’s make some meaningful aggregations.

# initialise empty list
data_list = []

for species_id, species_group in joined.groupby('NIMI'):
    lulc_count = species_group['LABEL2'].value_counts()
    top = lulc_count.head(1)
    # add info to list
    data_row_as_dict = {'species_id':species_id,
                        'all_sights': len(species_group),
                        'top_lulc': top.index[0],
                        'sights_in_top': top[0] }
    data_list.append(data_row_as_dict)

# Creates DataFrame from now filled list with data items.
top_sights = pd.DataFrame(data_list)

# Print the data
top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(5)
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
/tmp/ipykernel_16545/4112557494.py:11: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]`
  'sights_in_top': top[0] }
species_id all_sights top_lulc sights_in_top
18 valge-toonekurg 20 Heterogeneous agricultural areas 15
2 balti sõrmkäpp 20 Forests 12
6 kahkjaspunane sõrmkäpp 17 Forests 12
15 suur käopõll 16 Forests 14
1 aas-karukell 6 Forests 4

The “valge-toonekurk” (white stork) is the most sighted bird species in the data, and 15 out of 20 sights are in the “Heterogeneous agricultural areas” landuse type.

Download the notebook:

file:spatial-join.ipynb

Launch in the web/MyBinder:

image