Spatial join
A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in Polygon query.
We could now apply those techniques and create our
own function to perform a spatial join between two layers based on their
spatial relationship. We could for example join the attributes of a
polygon layer into a point layer where each point would get the
attributes of a polygon that contains
the point.
Luckily, spatial joins
(gpd.sjoin()
-function) is already implemented in Geopandas, thus we
do not need to create it ourselves. There are three possible types of
join that can be applied in spatial join that are determined with op
-parameter:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.
Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.
Download and clean the data
For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.
Unzip the file into into your L3 folder
porijogi_corine_landuse.dbf porijogi_corine_landuse.shp
porijogi_corine_landuse.prj porijogi_corine_landuse.shx
You should now have the files listed above in your Data folder.
Let’s read the data into geopandas and see what we have.
In [1]: import geopandas as gpd
# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"
# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]:
code_12 id remark area_ha shape_area clc_int Landuse geometry
0 112 EU-2024407 None 67.055321 670553.20630 112 URML POLYGON ((658854.791 6458244.203, 658826.936 6...
1 112 EU-2024418 None 36.452500 364525.00295 112 URML POLYGON ((663553.865 6459840.806, 663570.622 6...
2 112 EU-2024426 None 33.525145 335251.45070 112 URML POLYGON ((659006.349 6463680.667, 659031.241 6...
3 112 EU-2024431 None 30.111925 301119.25420 112 URML POLYGON ((658401.027 6466951.556, 658518.679 6...
4 112 EU-2024434 None 70.979465 709794.64765 112 URML POLYGON ((658348.310 6467489.990, 658040.310 6...
Okey so we have multiple columns in the dataset but the most important
one here is the column clc_int
(corine landuse code) that
tells the type of landuse cover under that polygon. Download the codes lookup table
In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.
In [5]: import pandas as pd
In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')
In [7]: codes
Out[7]:
CLC_CODE LABEL1 LABEL2 LABEL3 RGB
0 111 Artificial surfaces Urban fabric Continuous urban fabric 230-000-077
1 112 Artificial surfaces Urban fabric Discontinuous urban fabric 255-000-000
2 121 Artificial surfaces Industrial, commercial and transport units Industrial or commercial units 204-077-242
3 122 Artificial surfaces Industrial, commercial and transport units Road and rail networks and associated land 204-000-000
4 123 Artificial surfaces Industrial, commercial and transport units Port areas 230-204-204
5 124 Artificial surfaces Industrial, commercial and transport units Airports 230-204-230
6 131 Artificial surfaces Mine, dump and construction sites Mineral extraction sites 166-000-204
7 132 Artificial surfaces Mine, dump and construction sites Dump sites 166-077-000
8 133 Artificial surfaces Mine, dump and construction sites Construction sites 255-077-255
9 141 Artificial surfaces Artificial, non-agricultural vegetated areas Green urban areas 255-166-255
10 142 Artificial surfaces Artificial, non-agricultural vegetated areas Sport and leisure facilities 255-230-255
11 211 Agricultural areas Arable land Non-irrigated arable land 255-255-168
12 222 Agricultural areas Permanent crops Fruit trees and berry plantations 242-166-077
13 231 Agricultural areas Pastures Pastures 230-230-077
14 242 Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns 255-230-077
15 243 Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with... 230-204-077
16 311 Forest and semi natural areas Forests Broad-leaved forest 128-255-000
17 312 Forest and semi natural areas Forests Coniferous forest 000-166-000
18 313 Forest and semi natural areas Forests Mixed forest 077-255-000
19 321 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Natural grasslands 204-242-077
20 322 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Moors and heathland 166-255-128
21 324 Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub 166-242-000
22 331 Forest and semi natural areas Open spaces with little or no vegetation Beaches, dunes, sands 230-230-230
23 333 Forest and semi natural areas Open spaces with little or no vegetation Sparsely vegetated areas 204-255-204
24 411 Wetlands Inland wetlands Inland marshes 166-166-255
25 412 Wetlands Inland wetlands Peat bogs 077-077-255
26 421 Wetlands Maritime wetlands Salt marshes 204-204-255
27 511 Water bodies Inland waters Water courses 000-204-242
28 512 Water bodies Inland waters Water bodies 128-242-230
29 521 Water bodies Marine waters Coastal lagoons 000-255-166
30 523 Water bodies Marine waters Sea and ocean 230-242-255
This table contains a field CLC_CODE
which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int
.
We will now merge the lookup table codes
(a Pandas dataframe) into our lulc
GeoDataframe, based on the identifiers in the mentioned fields:
In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
In [9]: lulc.sample(10)
Out[9]:
code_12 id remark area_ha ... LABEL1 LABEL2 LABEL3 RGB
136 243 EU-2034427 None 148.476483 ... Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with... 230-204-077
239 324 EU-2052422 None 90.313232 ... Forest and semi natural areas Scrub and/or herbaceous vegetation associations Transitional woodland-shrub 166-242-000
167 313 EU-2046121 None 27.677425 ... Forest and semi natural areas Forests Mixed forest 077-255-000
70 231 EU-2030036 None 46.906026 ... Agricultural areas Pastures Pastures 230-230-077
63 231 EU-2029875 None 81.661622 ... Agricultural areas Pastures Pastures 230-230-077
55 211 EU-2026956 None 310.846714 ... Agricultural areas Arable land Non-irrigated arable land 255-255-168
80 242 EU-2032462 None 83.893750 ... Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns 255-230-077
109 243 EU-2034192 None 29.355642 ... Agricultural areas Heterogeneous agricultural areas Land principally occupied by agriculture, with... 230-204-077
144 312 EU-2041007 None 54.119761 ... Forest and semi natural areas Forests Coniferous forest 000-166-000
81 242 EU-2032473 None 54.556966 ... Agricultural areas Heterogeneous agricultural areas Complex cultivation patterns 255-230-077
[10 rows x 13 columns]
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]:
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
dtype='object')
Let’s also get rid of all unnecessary columns by selecting only
columns that we need i.e. Landuse
, LABEL2
and geometry
# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']
# Select those columns
In [12]: lulc = lulc[selected_cols]
# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]:
Landuse LABEL2 geometry
63 PAST Pastures POLYGON ((663852.277 6457100.927, 663820.431 6...
15 AGRL Arable land POLYGON ((667346.056 6448925.436, 667282.868 6...
142 FRSD Forests POLYGON ((659960.470 6466484.630, 659990.373 6...
114 AGRL Heterogeneous agricultural areas MULTIPOLYGON (((650961.058 6453763.991, 650960...
174 FRST Forests POLYGON ((665445.498 6448844.493, 665364.120 6...
21 AGRL Arable land POLYGON ((659696.375 6454079.995, 659655.429 6...
199 FRST Forests POLYGON ((655580.368 6459091.492, 655605.341 6...
130 AGRL Heterogeneous agricultural areas POLYGON ((656282.430 6459869.487, 656317.304 6...
20 AGRL Arable land POLYGON ((655302.115 6453709.695, 655281.748 6...
55 AGRL Arable land MULTIPOLYGON (((663877.511 6466895.588, 663738...
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers
Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.
Read the category_3_species_porijogi.gpkg layer into memory
# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"
# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
Let’s make sure that the coordinate reference system of the layers are identical:
# Check the crs of landuse
In [16]: lulc.crs
Out[16]:
<Derived Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]:
<Derived Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True
They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from lulc
(2) GeoDataFrame into
species
GeoDataFrame (1) by using gpd.sjoin()
-function
# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", predicate="within")
# Let's check the result
In [20]: join.head()
Out[20]:
OBJECTID LIIK NIMI EXT_SYST_I KKR_KOOD ... LAADIMISKP geometry index_right Landuse LABEL2
126 144115 taimed III läikiv kurdsirbik -1324418359 KLO9400024 ... 2018-10-29 POINT (652641.854 6458099.088) 224 RNGB Scrub and/or herbaceous vegetation associations
132 144202 taimed III siberi võhumõõk 1344032932 KLO9312850 ... 2018-10-29 POINT (660367.911 6469599.134) 143 FRSD Forests
157 151696 taimed III balti sõrmkäpp -376320786 KLO9317690 ... 2018-10-29 POINT (656609.812 6459582.666) 201 FRST Forests
201 154300 taimed III harilik käoraamat -403691661 KLO9303449 ... 2018-10-29 POINT (657356.897 6459780.112) 201 FRST Forests
257 154692 taimed III suur käopõll 1784194494 KLO9303464 ... 2018-10-29 POINT (657291.854 6459783.092) 201 FRST Forests
[5 rows x 13 columns]
Awesome! Now we have performed a successful spatial join where we got
two new columns into our join
GeoDataFrame, i.e. index_right
that tells the index of the matching polygon in the lulc
layer and
species
.
Let’s save this layer into a new Shapefile:
# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"
# Save to disk
join.to_file(outfp)
Do the results make sense? Let’s evaluate this a bit by grouping and querying the
resulting join
for largest landuse type and species types combinations:
In [21]: join['NIMI'].value_counts()
Out[21]:
balti sõrmkäpp 20
valge-toonekurg 20
kahkjaspunane sõrmkäpp 17
suur käopõll 16
aas-karukell 6
soo-neiuvaip 5
vööthuul-sõrmkäpp 5
sulgjas õhik 2
harilik käoraamat 2
siberi võhumõõk 2
läikiv kurdsirbik 1
valgelaup-rabakiil 1
Veski 1
kodukakk 1
hink 1
vingerjas 1
kuradi-sõrmkäpp 1
ohakasoomukas 1
suur-kuldtiib 1
roomav öövilge 1
hall käpp 1
suur-rabakiil 1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]:
Forests 52
Scrub and/or herbaceous vegetation associations 22
Heterogeneous agricultural areas 22
Pastures 4
Urban fabric 2
Inland waters 2
Arable land 2
Inland wetlands 1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []
In [24]: for species_id, species_group in join.groupby('NIMI'):
....: lulc_count = species_group['LABEL2'].value_counts()
....: top = lulc_count.head(1)
....: data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
....:
# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)
# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]:
species_id all_sights top_lulc sights_in_top
18 valge-toonekurg 20 Heterogeneous agricultural areas 15
2 balti sõrmkäpp 20 Forests 12
6 kahkjaspunane sõrmkäpp 17 Forests 12
15 suur käopõll 16 Forests 14
1 aas-karukell 6 Forests 4
13 soo-neiuvaip 5 Scrub and/or herbaceous vegetation associations 3
21 vööthuul-sõrmkäpp 5 Forests 3
4 harilik käoraamat 2 Forests 1
12 siberi võhumõõk 2 Forests 1
14 sulgjas õhik 2 Scrub and/or herbaceous vegetation associations 1
Launch in the web/MyBinder: