Spatial join¶
A Spatial join is yet another classic GIS problem. Getting attributes from one layer and transferring them into another layer based on their spatial relationship is something you most likely need to do on a regular basis.
The previous materials focused on learning how to perform a Point in Polygon query.
We could now apply those techniques and create our
own function to perform a spatial join between two layers based on their
spatial relationship. We could for example join the attributes of a
polygon layer into a point layer where each point would get the
attributes of a polygon that contains
the point.
Luckily, spatial joins
(gpd.sjoin()
-function) is already implemented in Geopandas, thus we
do not need to create it ourselves. There are three possible types of
join that can be applied in spatial join that are determined with op
-parameter:
"intersects"
"within"
"contains"
Sounds familiar? Yep, all of those spatial relationships were discussed in the previous materials, thus you should know how they work.
Let’s perform a spatial join between the species monitoring data GeoPackage file, category_3_species_porijogi.gpkg and a Polygon layer that is extracted Corine Landuse Cover for the Porijogi region.
Download and clean the data¶
For this lesson we will be using publicly available Corine Landuse Cover for the Porijogi region in Tartumaa.
Unzip the file into into your L3 folder
porijogi_corine_landuse.dbf porijogi_corine_landuse.shp
porijogi_corine_landuse.prj porijogi_corine_landuse.shx
You should now have the files listed above in your Data folder.
Let’s read the data into geopandas and see what we have.
In [1]: import geopandas as gpd
# Filepath
In [2]: fp = "source/_static/data/L3/porijogi_corine_landuse.shp"
# Read the data
In [3]: lulc = gpd.read_file(fp)
# See the first rows
In [4]: lulc.head()
Out[4]:
code_12 id remark area_ha shape_area clc_int Landuse \
0 112 EU-2024407 None 67.055321 670553.20630 112 URML
1 112 EU-2024418 None 36.452500 364525.00295 112 URML
2 112 EU-2024426 None 33.525145 335251.45070 112 URML
3 112 EU-2024431 None 30.111925 301119.25420 112 URML
4 112 EU-2024434 None 70.979465 709794.64765 112 URML
geometry
0 POLYGON ((658854.791 6458244.203, 658826.936 6...
1 POLYGON ((663553.865 6459840.806, 663570.622 6...
2 POLYGON ((659006.349 6463680.667, 659031.241 6...
3 POLYGON ((658401.027 6466951.556, 658518.679 6...
4 POLYGON ((658348.310 6467489.990, 658040.310 6...
Okey so we have multiple columns in the dataset but the most important
one here is the column clc_int
(corine landuse code) that
tells the type of landuse cover under that polygon. Download the codes lookup table
In order to know what the codes mean, we will merge a lookup table to our spatial landuse cover GeoDataframe.
In [5]: import pandas as pd
In [6]: codes = pd.read_csv('source/_static/data/L3/corine_landuse_codes.csv', sep=';')
In [7]: codes
Out[7]:
CLC_CODE LABEL1 \
0 111 Artificial surfaces
1 112 Artificial surfaces
2 121 Artificial surfaces
3 122 Artificial surfaces
4 123 Artificial surfaces
5 124 Artificial surfaces
6 131 Artificial surfaces
7 132 Artificial surfaces
8 133 Artificial surfaces
9 141 Artificial surfaces
10 142 Artificial surfaces
11 211 Agricultural areas
12 222 Agricultural areas
13 231 Agricultural areas
14 242 Agricultural areas
15 243 Agricultural areas
16 311 Forest and semi natural areas
17 312 Forest and semi natural areas
18 313 Forest and semi natural areas
19 321 Forest and semi natural areas
20 322 Forest and semi natural areas
21 324 Forest and semi natural areas
22 331 Forest and semi natural areas
23 333 Forest and semi natural areas
24 411 Wetlands
25 412 Wetlands
26 421 Wetlands
27 511 Water bodies
28 512 Water bodies
29 521 Water bodies
30 523 Water bodies
LABEL2 \
0 Urban fabric
1 Urban fabric
2 Industrial, commercial and transport units
3 Industrial, commercial and transport units
4 Industrial, commercial and transport units
5 Industrial, commercial and transport units
6 Mine, dump and construction sites
7 Mine, dump and construction sites
8 Mine, dump and construction sites
9 Artificial, non-agricultural vegetated areas
10 Artificial, non-agricultural vegetated areas
11 Arable land
12 Permanent crops
13 Pastures
14 Heterogeneous agricultural areas
15 Heterogeneous agricultural areas
16 Forests
17 Forests
18 Forests
19 Scrub and/or herbaceous vegetation associations
20 Scrub and/or herbaceous vegetation associations
21 Scrub and/or herbaceous vegetation associations
22 Open spaces with little or no vegetation
23 Open spaces with little or no vegetation
24 Inland wetlands
25 Inland wetlands
26 Maritime wetlands
27 Inland waters
28 Inland waters
29 Marine waters
30 Marine waters
LABEL3 RGB
0 Continuous urban fabric 230-000-077
1 Discontinuous urban fabric 255-000-000
2 Industrial or commercial units 204-077-242
3 Road and rail networks and associated land 204-000-000
4 Port areas 230-204-204
5 Airports 230-204-230
6 Mineral extraction sites 166-000-204
7 Dump sites 166-077-000
8 Construction sites 255-077-255
9 Green urban areas 255-166-255
10 Sport and leisure facilities 255-230-255
11 Non-irrigated arable land 255-255-168
12 Fruit trees and berry plantations 242-166-077
13 Pastures 230-230-077
14 Complex cultivation patterns 255-230-077
15 Land principally occupied by agriculture, with... 230-204-077
16 Broad-leaved forest 128-255-000
17 Coniferous forest 000-166-000
18 Mixed forest 077-255-000
19 Natural grasslands 204-242-077
20 Moors and heathland 166-255-128
21 Transitional woodland-shrub 166-242-000
22 Beaches, dunes, sands 230-230-230
23 Sparsely vegetated areas 204-255-204
24 Inland marshes 166-166-255
25 Peat bogs 077-077-255
26 Salt marshes 204-204-255
27 Water courses 000-204-242
28 Water bodies 128-242-230
29 Coastal lagoons 000-255-166
30 Sea and ocean 230-242-255
This table contains a field CLC_CODE
which we use to connect the correct mapping to our landuse cover GeoDataframe, which has a field clc_int
.
We will now merge the lookup table codes
(a Pandas dataframe) into our lulc
GeoDataframe, based on the identifiers in the mentioned fields:
In [8]: lulc = lulc.merge(codes, left_on='clc_int', right_on='CLC_CODE')
In [9]: lulc.sample(10)
Out[9]:
code_12 id remark area_ha shape_area clc_int Landuse \
97 242 EU-2032632 None 320.561233 3.205612e+06 242 AGRL
242 511 EU-2057279 None 680.904882 6.809048e+06 511 WATR
176 313 EU-2046188 None 1361.557146 1.361557e+07 313 FRST
123 243 EU-2034298 None 144.695370 1.446954e+06 243 AGRL
131 243 EU-2034362 None 137.819209 1.378192e+06 243 AGRL
204 313 EU-2046436 None 178.733078 1.787331e+06 313 FRST
34 211 EU-2026736 None 64.684476 6.468448e+05 211 AGRL
206 321 EU-2050766 None 158.484669 1.584847e+06 321 RNGE
221 324 EU-2052075 None 27.506101 2.750610e+05 324 RNGB
87 242 EU-2032505 None 72.501812 7.250181e+05 242 AGRL
geometry CLC_CODE \
97 MULTIPOLYGON (((658432.374 6469848.151, 658267... 242
242 POLYGON ((662550.899 6469623.451, 662525.245 6... 511
176 POLYGON ((650371.435 6452851.995, 650381.372 6... 313
123 POLYGON ((661923.007 6455403.736, 662145.620 6... 243
131 POLYGON ((659379.808 6459999.493, 659243.178 6... 243
204 POLYGON ((661448.216 6462739.348, 661435.432 6... 313
34 POLYGON ((661942.121 6457768.994, 661968.303 6... 211
206 POLYGON ((662183.682 6460761.495, 662185.931 6... 321
221 POLYGON ((652025.110 6457934.129, 652349.760 6... 324
87 POLYGON ((657763.169 6460603.043, 657761.623 6... 242
LABEL1 \
97 Agricultural areas
242 Water bodies
176 Forest and semi natural areas
123 Agricultural areas
131 Agricultural areas
204 Forest and semi natural areas
34 Agricultural areas
206 Forest and semi natural areas
221 Forest and semi natural areas
87 Agricultural areas
LABEL2 \
97 Heterogeneous agricultural areas
242 Inland waters
176 Forests
123 Heterogeneous agricultural areas
131 Heterogeneous agricultural areas
204 Forests
34 Arable land
206 Scrub and/or herbaceous vegetation associations
221 Scrub and/or herbaceous vegetation associations
87 Heterogeneous agricultural areas
LABEL3 RGB
97 Complex cultivation patterns 255-230-077
242 Water courses 000-204-242
176 Mixed forest 077-255-000
123 Land principally occupied by agriculture, with... 230-204-077
131 Land principally occupied by agriculture, with... 230-204-077
204 Mixed forest 077-255-000
34 Non-irrigated arable land 255-255-168
206 Natural grasslands 204-242-077
221 Transitional woodland-shrub 166-242-000
87 Complex cultivation patterns 255-230-077
# See the column names and confirm that we now have a column called LABEL2, that gives us some textual description for the landuse codes
In [10]: lulc.columns
Out[10]:
Index(['code_12', 'id', 'remark', 'area_ha', 'shape_area', 'clc_int',
'Landuse', 'geometry', 'CLC_CODE', 'LABEL1', 'LABEL2', 'LABEL3', 'RGB'],
dtype='object')
Let’s also get rid of all unnecessary columns by selecting only columns that we need i.e.
Landuse
,LABEL2
andgeometry
# Columns that will be sected
In [11]: selected_cols = ['Landuse', 'LABEL2','geometry']
# Select those columns
In [12]: lulc = lulc[selected_cols]
# Let's see 10 randomly sampled rows
In [13]: lulc.sample(10)
Out[13]:
Landuse LABEL2 \
210 RNGB Scrub and/or herbaceous vegetation associations
176 FRST Forests
71 AGRL Heterogeneous agricultural areas
127 AGRL Heterogeneous agricultural areas
213 RNGB Scrub and/or herbaceous vegetation associations
120 AGRL Heterogeneous agricultural areas
156 FRSE Forests
175 FRST Forests
140 FRSD Forests
155 FRSE Forests
geometry
210 POLYGON ((653587.374 6450466.487, 653427.242 6...
176 POLYGON ((650371.435 6452851.995, 650381.372 6...
71 POLYGON ((659371.432 6449378.989, 659096.054 6...
127 POLYGON ((661595.757 6457878.711, 661567.866 6...
213 POLYGON ((664367.934 6450738.987, 664230.617 6...
120 POLYGON ((655066.992 6455944.494, 655058.496 6...
156 POLYGON ((666014.039 6456928.579, 666009.484 6...
175 POLYGON ((659371.432 6449378.989, 659362.117 6...
140 POLYGON ((664529.993 6449567.988, 664562.435 6...
155 POLYGON ((663003.119 6457774.497, 663041.741 6...
Now we have cleaned the data and have only those columns that we need for our analysis.
Join the layers¶
Now we are ready to perform the spatial join between the two layers that we have. The aim here is to get information about how many species sightings (of which species) happened in which landuse types? . Thus, we want to join attributes from the landuse layer we just modified into the already used and prepared monitoring GeoPackage file, category_3_species_porijogi.gpkg.
Read the category_3_species_porijogi.gpkg layer into memory
# protected species under class 3 monitoring sightings
In [14]: species_fp = "source/_static/data/L3/category_3_species_porijogi.gpkg"
# Read data
In [15]: species = gpd.read_file(species_fp, layer='category_3_species_porijogi', driver='GPKG')
Let’s make sure that the coordinate reference system of the layers are identical
# Check the crs of landuse In [16]: lulc.crs Out[16]: {'init': 'epsg:3301'} # Check the crs of species layer in case we need to reproject the geometries to make them comparable In [17]: species.crs Out[17]: {'init': 'epsg:3301'} # Do they match? - We can test that In [18]: lulc.crs == species.crs Out[18]: True
They are identical. Thus, we can be sure that when doing spatial queries between layers the locations match and we get the right results e.g. from the spatial join that we are conducting here.
Let’s now join the attributes from
lulc
(2) GeoDataFrame intospecies
GeoDataFrame (1) by usinggpd.sjoin()
-function
# Make a spatial join
In [19]: join = gpd.sjoin(species, lulc, how="inner", op="within")
# Let's check the result
In [20]: join.head()
Out[20]:
OBJECTID LIIK NIMI EXT_SYST_I KKR_KOOD \
126 144115 taimed III läikiv kurdsirbik -1324418359 KLO9400024
132 144202 taimed III siberi võhumõõk 1344032932 KLO9312850
157 151696 taimed III balti sõrmkäpp -376320786 KLO9317690
201 154300 taimed III harilik käoraamat -403691661 KLO9303449
257 154692 taimed III suur käopõll 1784194494 KLO9303464
PRIV_TYYP STAATUS IMPORT LAADIMISKP \
126 Avalik kontrollitud 0 2018-10-29
132 Avalik kontrollitud 0 2018-10-29
157 Avalik kontrollitud 0 2018-10-29
201 Avalik kontrollitud 0 2018-10-29
257 Avalik kontrollitud 0 2018-10-29
geometry index_right Landuse \
126 POINT (652641.854 6458099.088) 224 RNGB
132 POINT (660367.911 6469599.134) 143 FRSD
157 POINT (656609.812 6459582.666) 201 FRST
201 POINT (657356.897 6459780.112) 201 FRST
257 POINT (657291.854 6459783.092) 201 FRST
LABEL2
126 Scrub and/or herbaceous vegetation associations
132 Forests
157 Forests
201 Forests
257 Forests
Awesome! Now we have performed a successful spatial join where we got
two new columns into our join
GeoDataFrame, i.e. index_right
that tells the index of the matching polygon in the lulc
layer and
species
.
Let’s save this layer into a new Shapefile
# Output path
outfp = "source/_static/data/L3/landuse_per_species.shp"
# Save to disk
join.to_file(outfp)
Do the results make sense? Let’s evaluate this a bit by grouping and querying the
resulting join
for largest landuse type and species types combinations:
In [21]: join['NIMI'].value_counts()
Out[21]:
balti sõrmkäpp 20
valge-toonekurg 20
kahkjaspunane sõrmkäpp 17
suur käopõll 16
aas-karukell 6
soo-neiuvaip 5
vööthuul-sõrmkäpp 5
siberi võhumõõk 2
sulgjas õhik 2
harilik käoraamat 2
suur-rabakiil 1
kodukakk 1
vingerjas 1
valgelaup-rabakiil 1
hink 1
Veski 1
läikiv kurdsirbik 1
hall käpp 1
kuradi-sõrmkäpp 1
ohakasoomukas 1
roomav öövilge 1
suur-kuldtiib 1
Name: NIMI, dtype: int64
In [22]: join['LABEL2'].value_counts()
Out[22]:
Forests 52
Heterogeneous agricultural areas 22
Scrub and/or herbaceous vegetation associations 22
Pastures 4
Inland waters 2
Urban fabric 2
Arable land 2
Inland wetlands 1
Name: LABEL2, dtype: int64
# initialise empty list
In [23]: data_list = []
In [24]: for species_id, species_group in join.groupby('NIMI'):
....: lulc_count = species_group['LABEL2'].value_counts()
....: top = lulc_count.head(1)
....: data_list.append({'species_id':species_id, 'all_sights': len(species_group), 'top_lulc': top.index[0], 'sights_in_top': top[0]})
....:
# Creates DataFrame from now filled list wit hdata items.
In [25]: top_sights = pd.DataFrame(data_list)
# Print the data
In [26]: top_sights.sort_values(by=['all_sights','sights_in_top'], ascending=False).head(10)
Out[26]:
species_id all_sights \
18 valge-toonekurg 20
2 balti sõrmkäpp 20
6 kahkjaspunane sõrmkäpp 17
15 suur käopõll 16
1 aas-karukell 6
13 soo-neiuvaip 5
21 vööthuul-sõrmkäpp 5
4 harilik käoraamat 2
12 siberi võhumõõk 2
14 sulgjas õhik 2
top_lulc sights_in_top
18 Heterogeneous agricultural areas 15
2 Forests 12
6 Forests 12
15 Forests 14
1 Forests 4
13 Scrub and/or herbaceous vegetation associations 3
21 Forests 3
4 Scrub and/or herbaceous vegetation associations 1
12 Heterogeneous agricultural areas 1
14 Scrub and/or herbaceous vegetation associations 1
Launch in the web/MyBinder: