Spatial join¶
A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in Polygon query.
We could now apply those techniques and create our
own function to perform a spatial join between two layers based on their
spatial relationship. We could for example join the attributes of a
polygon layer into a point layer where each point would get the
attributes of a polygon that contains
the point.
Luckily, spatial joins
(gpd.sjoin()
-function) is already implemented in Geopandas, thus we
do not need to create it ourselves. There are three possible types of
join that can be applied in spatial join that are determined with op
-parameter:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.
Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.
Download and clean the data¶
For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.
Unzip the file into into your L3 folder
porijogi_corine_landuse.dbf porijogi_corine_landuse.shp
porijogi_corine_landuse.prj porijogi_corine_landuse.shx
You should now have the files listed above in your Data folder.
Let’s read the data into geopandas and see what we have.
In [1]: import geopandas as gpd
# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"
# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]:
code_12 id remark area_ha shape_area clc_int Landuse \
0 112 EU-2024407 None 67.055321 670553.20630 112 URML
1 112 EU-2024418 None 36.452500 364525.00295 112 URML
2 112 EU-2024426 None 33.525145 335251.45070 112 URML
3 112 EU-2024431 None 30.111925 301119.25420 112 URML
4 112 EU-2024434 None 70.979465 709794.64765 112 URML
geometry
0 POLYGON ((658854.791 6458244.203, 658826.936 6...
1 POLYGON ((663553.865 6459840.806, 663570.622 6...
2 POLYGON ((659006.349 6463680.667, 659031.241 6...
3 POLYGON ((658401.027 6466951.556, 658518.679 6...
4 POLYGON ((658348.310 6467489.990, 658040.310 6...
Okey so we have multiple columns in the dataset but the most important
one here is the column clc_int
(corine landuse code) that
tells the type of landuse cover under that polygon. Download the codes lookup table
In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.
In [5]: import pandas as pd
In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')
In [7]: codes
Out[7]:
CLC_CODE LABEL1 \
0 111 Artificial surfaces
1 112 Artificial surfaces
2 121 Artificial surfaces
3 122 Artificial surfaces
4 123 Artificial surfaces
5 124 Artificial surfaces
6 131 Artificial surfaces
7 132 Artificial surfaces
8 133 Artificial surfaces
9 141 Artificial surfaces
10 142 Artificial surfaces
11 211 Agricultural areas
12 222 Agricultural areas
13 231 Agricultural areas
14 242 Agricultural areas
15 243 Agricultural areas
16 311 Forest and semi natural areas
17 312 Forest and semi natural areas
18 313 Forest and semi natural areas
19 321 Forest and semi natural areas
20 322 Forest and semi natural areas
21 324 Forest and semi natural areas
22 331 Forest and semi natural areas
23 333 Forest and semi natural areas
24 411 Wetlands
25 412 Wetlands
26 421 Wetlands
27 511 Water bodies
28 512 Water bodies
29 521 Water bodies
30 523 Water bodies
LABEL2 \
0 Urban fabric
1 Urban fabric
2 Industrial, commercial and transport units
3 Industrial, commercial and transport units
4 Industrial, commercial and transport units
5 Industrial, commercial and transport units
6 Mine, dump and construction sites
7 Mine, dump and construction sites
8 Mine, dump and construction sites
9 Artificial, non-agricultural vegetated areas
10 Artificial, non-agricultural vegetated areas
11 Arable land
12 Permanent crops
13 Pastures
14 Heterogeneous agricultural areas
15 Heterogeneous agricultural areas
16 Forests
17 Forests
18 Forests
19 Scrub and/or herbaceous vegetation associations
20 Scrub and/or herbaceous vegetation associations
21 Scrub and/or herbaceous vegetation associations
22 Open spaces with little or no vegetation
23 Open spaces with little or no vegetation
24 Inland wetlands
25 Inland wetlands
26 Maritime wetlands
27 Inland waters
28 Inland waters
29 Marine waters
30 Marine waters
LABEL3 RGB
0 Continuous urban fabric 230-000-077
1 Discontinuous urban fabric 255-000-000
2 Industrial or commercial units 204-077-242
3 Road and rail networks and associated land 204-000-000
4 Port areas 230-204-204
5 Airports 230-204-230
6 Mineral extraction sites 166-000-204
7 Dump sites 166-077-000
8 Construction sites 255-077-255
9 Green urban areas 255-166-255
10 Sport and leisure facilities 255-230-255
11 Non-irrigated arable land 255-255-168
12 Fruit trees and berry plantations 242-166-077
13 Pastures 230-230-077
14 Complex cultivation patterns 255-230-077
15 Land principally occupied by agriculture, with... 230-204-077
16 Broad-leaved forest 128-255-000
17 Coniferous forest 000-166-000
18 Mixed forest 077-255-000
19 Natural grasslands 204-242-077
20 Moors and heathland 166-255-128
21 Transitional woodland-shrub 166-242-000
22 Beaches, dunes, sands 230-230-230
23 Sparsely vegetated areas 204-255-204
24 Inland marshes 166-166-255
25 Peat bogs 077-077-255
26 Salt marshes 204-204-255
27 Water courses 000-204-242
28 Water bodies 128-242-230
29 Coastal lagoons 000-255-166
30 Sea and ocean 230-242-255
This table contains a field CLC_CODE
which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int
.
We will now merge the lookup table codes
(a Pandas dataframe) into our lulc
GeoDataframe, based on the identifiers in the mentioned fields:
In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
In [9]: lulc.sample(10)
Out[9]:
code_12 id remark area_ha shape_area clc_int Landuse \
30 211 EU-2026721 None 51.589110 5.158911e+05 211 AGRL
204 313 EU-2046436 None 178.733078 1.787331e+06 313 FRST
46 211 EU-2026822 None 37.432131 3.743213e+05 211 AGRL
16 211 EU-2026633 None 34.944840 3.494484e+05 211 AGRL
57 231 EU-2029756 None 57.420560 5.742056e+05 231 PAST
118 243 EU-2034261 None 26.894219 2.689422e+05 243 AGRL
108 243 EU-2034191 None 1413.048177 1.413048e+07 243 AGRL
69 231 EU-2030017 None 785.112680 7.851127e+06 231 PAST
116 243 EU-2034238 None 62.887154 6.288715e+05 243 AGRL
154 312 EU-2041206 None 90.963689 9.096369e+05 312 FRSE
geometry CLC_CODE \
30 POLYGON ((664580.232 6456124.995, 664577.116 6... 211
204 POLYGON ((661448.216 6462739.348, 661435.432 6... 313
46 POLYGON ((662124.874 6462823.495, 662220.245 6... 211
16 POLYGON ((668192.805 6450263.495, 668179.121 6... 211
57 POLYGON ((662912.623 6448033.995, 662855.868 6... 231
118 POLYGON ((663042.054 6453857.496, 662837.556 6... 243
108 MULTIPOLYGON (((654082.681 6452253.489, 654077... 243
69 POLYGON ((662568.340 6468060.656, 662619.927 6... 231
116 POLYGON ((660278.062 6452665.995, 660167.995 6... 243
154 POLYGON ((657220.436 6458048.495, 657254.620 6... 312
LABEL1 LABEL2 \
30 Agricultural areas Arable land
204 Forest and semi natural areas Forests
46 Agricultural areas Arable land
16 Agricultural areas Arable land
57 Agricultural areas Pastures
118 Agricultural areas Heterogeneous agricultural areas
108 Agricultural areas Heterogeneous agricultural areas
69 Agricultural areas Pastures
116 Agricultural areas Heterogeneous agricultural areas
154 Forest and semi natural areas Forests
LABEL3 RGB
30 Non-irrigated arable land 255-255-168
204 Mixed forest 077-255-000
46 Non-irrigated arable land 255-255-168
16 Non-irrigated arable land 255-255-168
57 Pastures 230-230-077
118 Land principally occupied by agriculture, with... 230-204-077
108 Land principally occupied by agriculture, with... 230-204-077
69 Pastures 230-230-077
116 Land principally occupied by agriculture, with... 230-204-077
154 Coniferous forest 000-166-000
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]:
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
dtype='object')
Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e.
Landuse
,LABEL2
andgeometry
# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']
# Select those columns
In [12]: lulc = lulc[selected_cols]
# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]:
Landuse LABEL2 \
80 AGRL Heterogeneous agricultural areas
241 WETL Inland wetlands
130 AGRL Heterogeneous agricultural areas
65 PAST Pastures
21 AGRL Arable land
49 AGRL Arable land
55 AGRL Arable land
4 URML Urban fabric
161 FRSE Forests
230 RNGB Scrub and/or herbaceous vegetation associations
geometry
80 POLYGON ((661999.553 6456694.993, 662158.623 6...
241 POLYGON ((660625.437 6469578.988, 660637.496 6...
130 POLYGON ((656282.430 6459869.487, 656317.304 6...
65 POLYGON ((658319.810 6460064.491, 658182.936 6...
21 POLYGON ((659696.375 6454079.995, 659655.429 6...
49 POLYGON ((664101.431 6464658.061, 664168.493 6...
55 MULTIPOLYGON (((663877.511 6466895.588, 663738...
4 POLYGON ((658348.310 6467489.990, 658040.310 6...
161 POLYGON ((654532.876 6461046.171, 654635.125 6...
230 POLYGON ((662872.310 6461090.487, 662990.991 6...
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers¶
Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.
Read the category_3_species_porijogi.gpkg layer into memory
# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"
# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
Let’s make sure that the coordinate reference system of the layers are identical
# Check the crs of landuse
In [16]: lulc.crs
Out[16]:
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Check the crs of species layer in case we need to reproject the geometries to make them comparable
In [17]: species.crs
Out[17]:
<Projected CRS: EPSG:3301>
Name: Estonian Coordinate System of 1997
Axis Info [cartesian]:
- X[north]: Northing (metre)
- Y[east]: Easting (metre)
Area of Use:
- name: Estonia
- bounds: (20.37, 57.52, 28.2, 60.0)
Coordinate Operation:
- name: Estonian National Grid
- method: Lambert Conic Conformal (2SP)
Datum: Estonia 1997
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
# Do they match? - We can test that
In [18]: lulc.crs == species.crs
Out[18]: True
They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from
lulc
(2) GeoDataFrame intospecies
GeoDataFrame (1) by usinggpd.sjoin()
-function
# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")
# Let's check the result
In [20]: join.head()
Out[20]:
OBJECTID LIIK NIMI EXT_SYST_I KKR_KOOD \
126 144115 taimed III läikiv kurdsirbik -1324418359 KLO9400024
132 144202 taimed III siberi võhumõõk 1344032932 KLO9312850
157 151696 taimed III balti sõrmkäpp -376320786 KLO9317690
201 154300 taimed III harilik käoraamat -403691661 KLO9303449
257 154692 taimed III suur käopõll 1784194494 KLO9303464
PRIV_TYYP STAATUS IMPORT LAADIMISKP \
126 Avalik kontrollitud 0 2018-10-29
132 Avalik kontrollitud 0 2018-10-29
157 Avalik kontrollitud 0 2018-10-29
201 Avalik kontrollitud 0 2018-10-29
257 Avalik kontrollitud 0 2018-10-29
geometry index_right Landuse \
126 POINT (652641.854 6458099.088) 224 RNGB
132 POINT (660367.911 6469599.134) 143 FRSD
157 POINT (656609.812 6459582.666) 201 FRST
201 POINT (657356.897 6459780.112) 201 FRST
257 POINT (657291.854 6459783.092) 201 FRST
LABEL2
126 Scrub and/or herbaceous vegetation associations
132 Forests
157 Forests
201 Forests
257 Forests
Awesome! Now we have performed a successful spatial join where we got
two new columns into our join
GeoDataFrame, i.e. index_right
that tells the index of the matching polygon in the lulc
layer and
species
.
Let’s save this layer into a new Shapefile
# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"
# Save to disk
join.to_file(outfp)
Do the results make sense? Let’s evaluate this a bit by grouping and querying the
resulting join
for largest landuse type and species types combinations:
In [21]: join['NIMI'].value_counts()
Out[21]:
balti sõrmkäpp 20
valge-toonekurg 20
kahkjaspunane sõrmkäpp 17
suur käopõll 16
aas-karukell 6
vööthuul-sõrmkäpp 5
soo-neiuvaip 5
sulgjas õhik 2
siberi võhumõõk 2
harilik käoraamat 2
läikiv kurdsirbik 1
suur-kuldtiib 1
suur-rabakiil 1
roomav öövilge 1
kuradi-sõrmkäpp 1
hink 1
Veski 1
valgelaup-rabakiil 1
ohakasoomukas 1
kodukakk 1
hall käpp 1
vingerjas 1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]:
Forests 52
Heterogeneous agricultural areas 22
Scrub and/or herbaceous vegetation associations 22
Pastures 4
Urban fabric 2
Inland waters 2
Arable land 2
Inland wetlands 1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []
In [24]: for species_id, species_group in join.groupby('NIMI'):
....: lulc_count = species_group['LABEL2'].value_counts()
....: top = lulc_count.head(1)
....: data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
....:
# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)
# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]:
species_id all_sights \
18 valge-toonekurg 20
2 balti sõrmkäpp 20
6 kahkjaspunane sõrmkäpp 17
15 suur käopõll 16
1 aas-karukell 6
13 soo-neiuvaip 5
21 vööthuul-sõrmkäpp 5
4 harilik käoraamat 2
12 siberi võhumõõk 2
14 sulgjas õhik 2
top_lulc sights_in_top
18 Heterogeneous agricultural areas 15
2 Forests 12
6 Forests 12
15 Forests 14
1 Forests 4
13 Scrub and/or herbaceous vegetation associations 3
21 Forests 3
4 Scrub and/or herbaceous vegetation associations 1
12 Heterogeneous agricultural areas 1
14 Scrub and/or herbaceous vegetation associations 1
Launch in the web/MyBinder: