Spatial join
A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in Polygon query.
We could now apply those techniques and create our
own function to perform a spatial join between two layers based on their
spatial relationship. We could for example join the attributes of a
polygon layer into a point layer where each point would get the
attributes of a polygon that contains
the point.
Luckily, spatial joins
(gpd.sjoin()
-function) is already implemented in Geopandas, thus we
do not need to create it ourselves. There are three possible types of
join that can be applied in spatial join that are determined with op
-parameter:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.
Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.
Download and clean the data
For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.
Unzip the file into into your L3 folder
porijogi_corine_landuse.dbf porijogi_corine_landuse.shp
porijogi_corine_landuse.prj porijogi_corine_landuse.shx
You should now have the files listed above in your Data folder.
Let’s read the data into geopandas and see what we have.
In [1]: import geopandas as gpd
# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"
# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]:
code_12 id remark area_ha shape_area clc_int Landuse \
0 112 EU-2024407 None 67.055321 670553.20630 112 URML
1 112 EU-2024418 None 36.452500 364525.00295 112 URML
2 112 EU-2024426 None 33.525145 335251.45070 112 URML
3 112 EU-2024431 None 30.111925 301119.25420 112 URML
4 112 EU-2024434 None 70.979465 709794.64765 112 URML
geometry
0 POLYGON ((658854.791 6458244.203, 658826.936 6...
1 POLYGON ((663553.865 6459840.806, 663570.622 6...
2 POLYGON ((659006.349 6463680.667, 659031.241 6...
3 POLYGON ((658401.027 6466951.556, 658518.679 6...
4 POLYGON ((658348.310 6467489.990, 658040.310 6...
Okey so we have multiple columns in the dataset but the most important
one here is the column clc_int
(corine landuse code) that
tells the type of landuse cover under that polygon. Download the codes lookup table
In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.
In [5]: import pandas as pd
In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')
In [7]: codes
Out[7]:
CLC_CODE LABEL1 \
0 111 Artificial surfaces
1 112 Artificial surfaces
2 121 Artificial surfaces
3 122 Artificial surfaces
4 123 Artificial surfaces
5 124 Artificial surfaces
6 131 Artificial surfaces
7 132 Artificial surfaces
8 133 Artificial surfaces
9 141 Artificial surfaces
10 142 Artificial surfaces
11 211 Agricultural areas
12 222 Agricultural areas
13 231 Agricultural areas
14 242 Agricultural areas
15 243 Agricultural areas
16 311 Forest and semi natural areas
17 312 Forest and semi natural areas
18 313 Forest and semi natural areas
19 321 Forest and semi natural areas
20 322 Forest and semi natural areas
21 324 Forest and semi natural areas
22 331 Forest and semi natural areas
23 333 Forest and semi natural areas
24 411 Wetlands
25 412 Wetlands
26 421 Wetlands
27 511 Water bodies
28 512 Water bodies
29 521 Water bodies
30 523 Water bodies
LABEL2 \
0 Urban fabric
1 Urban fabric
2 Industrial, commercial and transport units
3 Industrial, commercial and transport units
4 Industrial, commercial and transport units
5 Industrial, commercial and transport units
6 Mine, dump and construction sites
7 Mine, dump and construction sites
8 Mine, dump and construction sites
9 Artificial, non-agricultural vegetated areas
10 Artificial, non-agricultural vegetated areas
11 Arable land
12 Permanent crops
13 Pastures
14 Heterogeneous agricultural areas
15 Heterogeneous agricultural areas
16 Forests
17 Forests
18 Forests
19 Scrub and/or herbaceous vegetation associations
20 Scrub and/or herbaceous vegetation associations
21 Scrub and/or herbaceous vegetation associations
22 Open spaces with little or no vegetation
23 Open spaces with little or no vegetation
24 Inland wetlands
25 Inland wetlands
26 Maritime wetlands
27 Inland waters
28 Inland waters
29 Marine waters
30 Marine waters
LABEL3 RGB
0 Continuous urban fabric 230-000-077
1 Discontinuous urban fabric 255-000-000
2 Industrial or commercial units 204-077-242
3 Road and rail networks and associated land 204-000-000
4 Port areas 230-204-204
5 Airports 230-204-230
6 Mineral extraction sites 166-000-204
7 Dump sites 166-077-000
8 Construction sites 255-077-255
9 Green urban areas 255-166-255
10 Sport and leisure facilities 255-230-255
11 Non-irrigated arable land 255-255-168
12 Fruit trees and berry plantations 242-166-077
13 Pastures 230-230-077
14 Complex cultivation patterns 255-230-077
15 Land principally occupied by agriculture, with... 230-204-077
16 Broad-leaved forest 128-255-000
17 Coniferous forest 000-166-000
18 Mixed forest 077-255-000
19 Natural grasslands 204-242-077
20 Moors and heathland 166-255-128
21 Transitional woodland-shrub 166-242-000
22 Beaches, dunes, sands 230-230-230
23 Sparsely vegetated areas 204-255-204
24 Inland marshes 166-166-255
25 Peat bogs 077-077-255
26 Salt marshes 204-204-255
27 Water courses 000-204-242
28 Water bodies 128-242-230
29 Coastal lagoons 000-255-166
30 Sea and ocean 230-242-255
This table contains a field CLC_CODE
which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int
.
We will now merge the lookup table codes
(a Pandas dataframe) into our lulc
GeoDataframe, based on the identifiers in the mentioned fields:
In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
In [9]: lulc.sample(10)
Out[9]:
code_12 id remark area_ha shape_area clc_int Landuse \
107 243 EU-2034187 None 42.361981 4.236198e+05 243 AGRL
39 211 EU-2026767 None 657.003251 6.570033e+06 211 AGRL
147 312 EU-2041029 None 43.920683 4.392068e+05 312 FRSE
58 231 EU-2029759 None 92.973231 9.297323e+05 231 PAST
116 243 EU-2034238 None 62.887154 6.288715e+05 243 AGRL
64 231 EU-2029878 None 256.324714 2.563247e+06 231 PAST
93 242 EU-2032606 None 99.110014 9.911001e+05 242 AGRL
179 313 EU-2046222 None 53.232284 5.323228e+05 313 FRST
171 313 EU-2046156 None 60.916253 6.091625e+05 313 FRST
99 243 EU-2034113 None 122.314246 1.223142e+06 243 AGRL
geometry CLC_CODE \
107 POLYGON ((669087.936 6449236.491, 669076.938 6... 243
39 POLYGON ((661203.930 6460226.991, 661289.184 6... 211
147 POLYGON ((666235.246 6446880.994, 666189.937 6... 312
58 POLYGON ((658221.047 6449814.551, 658231.796 6... 231
116 POLYGON ((660278.062 6452665.995, 660167.995 6... 243
64 POLYGON ((655454.919 6458495.026, 655094.679 6... 231
93 POLYGON ((664829.263 6465229.475, 664866.937 6... 242
179 POLYGON ((665337.868 6452005.635, 665346.810 6... 313
171 POLYGON ((663292.056 6449565.496, 663313.305 6... 313
99 POLYGON ((658469.924 6447204.593, 658475.492 6... 243
LABEL1 LABEL2 \
107 Agricultural areas Heterogeneous agricultural areas
39 Agricultural areas Arable land
147 Forest and semi natural areas Forests
58 Agricultural areas Pastures
116 Agricultural areas Heterogeneous agricultural areas
64 Agricultural areas Pastures
93 Agricultural areas Heterogeneous agricultural areas
179 Forest and semi natural areas Forests
171 Forest and semi natural areas Forests
99 Agricultural areas Heterogeneous agricultural areas
LABEL3 RGB
107 Land principally occupied by agriculture, with... 230-204-077
39 Non-irrigated arable land 255-255-168
147 Coniferous forest 000-166-000
58 Pastures 230-230-077
116 Land principally occupied by agriculture, with... 230-204-077
64 Pastures 230-230-077
93 Complex cultivation patterns 255-230-077
179 Mixed forest 077-255-000
171 Mixed forest 077-255-000
99 Land principally occupied by agriculture, with... 230-204-077
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]:
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
dtype='object')
Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e.
Landuse
,LABEL2
andgeometry
# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']
# Select those columns
In [12]: lulc = lulc[selected_cols]
# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]:
Landuse LABEL2 \
200 FRST Forests
83 AGRL Heterogeneous agricultural areas
51 AGRL Arable land
194 FRST Forests
15 AGRL Arable land
199 FRST Forests
169 FRST Forests
26 AGRL Arable land
103 AGRL Heterogeneous agricultural areas
44 AGRL Arable land
geometry
200 MULTIPOLYGON (((666229.712 6456136.660, 666214...
83 POLYGON ((665405.163 6457486.269, 665401.873 6...
51 POLYGON ((662998.119 6465195.489, 663004.989 6...
194 POLYGON ((665177.900 6456656.750, 665299.453 6...
15 POLYGON ((667346.056 6448925.436, 667282.868 6...
199 POLYGON ((655580.368 6459091.492, 655605.341 6...
169 MULTIPOLYGON (((650215.358 6450779.373, 650257...
26 POLYGON ((654851.749 6456870.487, 654760.741 6...
103 POLYGON ((666635.395 6447379.819, 666686.560 6...
44 POLYGON ((659091.304 6461255.987, 658975.936 6...
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers
Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.
Read the category_3_species_porijogi.gpkg layer into memory
# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"
# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
Let’s make sure that the coordinate reference system of the layers are identical
# Check the crs of landuse
In [16]: lulc.crs
Out[16]:
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]:
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia - onshore and offshore.
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True
They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from
lulc
(2) GeoDataFrame intospecies
GeoDataFrame (1) by usinggpd.sjoin()
-function
# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")
# Let's check the result
In [20]: join.head()
Out[20]:
OBJECTID LIIK NIMI EXT_SYST_I KKR_KOOD \
126 144115 taimed III läikiv kurdsirbik -1324418359 KLO9400024
132 144202 taimed III siberi võhumõõk 1344032932 KLO9312850
157 151696 taimed III balti sõrmkäpp -376320786 KLO9317690
201 154300 taimed III harilik käoraamat -403691661 KLO9303449
257 154692 taimed III suur käopõll 1784194494 KLO9303464
PRIV_TYYP STAATUS IMPORT LAADIMISKP \
126 Avalik kontrollitud 0 2018-10-29
132 Avalik kontrollitud 0 2018-10-29
157 Avalik kontrollitud 0 2018-10-29
201 Avalik kontrollitud 0 2018-10-29
257 Avalik kontrollitud 0 2018-10-29
geometry index_right Landuse \
126 POINT (652641.854 6458099.088) 224 RNGB
132 POINT (660367.911 6469599.134) 143 FRSD
157 POINT (656609.812 6459582.666) 201 FRST
201 POINT (657356.897 6459780.112) 201 FRST
257 POINT (657291.854 6459783.092) 201 FRST
LABEL2
126 Scrub and/or herbaceous vegetation associations
132 Forests
157 Forests
201 Forests
257 Forests
Awesome! Now we have performed a successful spatial join where we got
two new columns into our join
GeoDataFrame, i.e. index_right
that tells the index of the matching polygon in the lulc
layer and
species
.
Let’s save this layer into a new Shapefile
# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"
# Save to disk
join.to_file(outfp)
Do the results make sense? Let’s evaluate this a bit by grouping and querying the
resulting join
for largest landuse type and species types combinations:
In [21]: join['NIMI'].value_counts()
Out[21]:
balti sõrmkäpp 20
valge-toonekurg 20
kahkjaspunane sõrmkäpp 17
suur käopõll 16
aas-karukell 6
soo-neiuvaip 5
vööthuul-sõrmkäpp 5
sulgjas õhik 2
harilik käoraamat 2
siberi võhumõõk 2
läikiv kurdsirbik 1
valgelaup-rabakiil 1
Veski 1
kodukakk 1
hink 1
vingerjas 1
kuradi-sõrmkäpp 1
ohakasoomukas 1
suur-kuldtiib 1
roomav öövilge 1
hall käpp 1
suur-rabakiil 1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]:
Forests 52
Scrub and/or herbaceous vegetation associations 22
Heterogeneous agricultural areas 22
Pastures 4
Urban fabric 2
Inland waters 2
Arable land 2
Inland wetlands 1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []
In [24]: for species_id, species_group in join.groupby('NIMI'):
....: lulc_count = species_group['LABEL2'].value_counts()
....: top = lulc_count.head(1)
....: data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
....:
# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)
# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]:
species_id all_sights \
18 valge-toonekurg 20
2 balti sõrmkäpp 20
6 kahkjaspunane sõrmkäpp 17
15 suur käopõll 16
1 aas-karukell 6
13 soo-neiuvaip 5
21 vööthuul-sõrmkäpp 5
4 harilik käoraamat 2
12 siberi võhumõõk 2
14 sulgjas õhik 2
top_lulc sights_in_top
18 Heterogeneous agricultural areas 15
2 Forests 12
6 Forests 12
15 Forests 14
1 Forests 4
13 Scrub and/or herbaceous vegetation associations 3
21 Forests 3
4 Forests 1
12 Forests 1
14 Scrub and/or herbaceous vegetation associations 1
Launch in the web/MyBinder: