Independent Exercise¶
This is an independent exercise for you to practice working with Python, DataFrames and geometry data. Results will be discussed in class but are not scored. In this exercise you need to do some data cleaning and pre-processing in order to create geometries in Geopandas.
Data¶
Download the following CSV file: https://opendata.smit.ee/ppa/csv/avalik_3.csv and save it into your working folder.
It is a dataset provided by the Information Technology and Development Center of the Ministry of the Interior of Estonia (SMIT). SMIT is the largest IT institution in the country, which creates and manages information systems necessary for saving lives and ensuring internal security. The data contains anonymised records of crimes, felonies and other offenses against the law. The data is binned into squares of either 500mx500m or 1000mx1000m. However, whenever we work with datasets we can expect issues and errors with the data and have to be able to get the valuable information out.
Create a Jupyter notebook and import some of our default libraries
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import geopandas as gpd
In [4]: from shapely.geometry import box, Polygon
Sections¶
Problem 1: Load and parse CSV file to get useful columns
Problem 2: Polygons to map
Problem 3: Summarize damages per area
Problem 1: Parse CSV file to get useful columns¶
The main problem is that a) all fields are text/string fields and can’t be acted upon immediately. Usually we need numerical data for analytical and geographical purposes.
In order to read even the CSV into Pandas we need to employ some low level commands:
# read the CSV and strip whitespace
df = pd.read_table("avalik_3.csv").apply(lambda col: col.str.strip())
Now you have the dataframe to work with. Check things like ‘df.head(5)’, ‘df.columns’ and ‘df.dtypes’ to look at the data.
There are many interesting columns to analyse, like date and time of day of the offense, municipality, region or city, specific information about the type of offense, the damage caused and coordinate pairs. We will focus for now on the coordinate pairs Lest_X and Lest_Y as well as the damage caused Kahjusumma.
These fields are text/strings, and contain usually two numbers, linked with a dash/minus, e.g. ‘6589000-6589499’ or ‘0-499’. In Python you can split strings using the split-function.
# exemplary step to extract two coordinate values
In [5]: x_var = '6589000-6589499'
# we split based on the '-'
In [6]: split_list = x_var.split('-')
# split_list is a list, that holds all the elements that were split, it's length should be 2
In [7]: print(len(split_list))
2
# the content of the list is:
In [8]: print(split_list)
['6589000', '6589499']
# now we can access each element by index
In [9]: print(split_list[0])
6589000
In [10]: a = split_list[0]
In [11]: b = split_list[1]
Now these variables are technically still text/string objects. In order to extract and convert the number, which is maintained as a text object currently, we can try to cast into a numerical datatypes.
If we take a whole Pandas DataFrame column we can use the Pandas to_numeric function.
df['x_min'] = pd.to_numeric(df['x_min_text'], errors='coerce')
# or
df['x_min'] = df['x_min_text'].apply(pd.to_numeric, errors='coerce')
If you’d like to handle that on a per-element-level, you could do the cast/conversion in your function, using Python’s built-in cast functions int() or float():
# the elements from before ...
In [12]: print(type(split_list[0]))
<class 'str'>
In [13]: a = split_list[0]
In [14]: b = split_list[1]
In [15]: print(type(a))
<class 'str'>
# now using
In [16]: a_int = int(a)
In [17]: print(type(a_int))
<class 'int'>
Sometimes the data is not in a form or value that can be cast/converted in to numbers. In this case will get an Error, a so called Exception. In Python you can easily ask for forgiveness using a try: / except: block:
In [18]: good_text_int = "5"
In [19]: good_text_float = "5.5"
In [20]: not_a_number = "argl"
# all still text/string
In [21]: print(type(good_text_int))
<class 'str'>
In [22]: try:
....: x1 = int(good_text_int)
....: print(x1)
....: print(type(x1))
....: except:
....: print("error converting good_text_int")
....:
5
<class 'int'>
# this should work
In [23]: try:
....: x2 = float(good_text_float)
....: print(x2)
....: print(type(x2))
....: except:
....: print("error converting good_text_float")
....:
5.5
<class 'float'>
# this will fail
In [24]: try:
....: x3 = float(not_a_number)
....: print(x3)
....: print(type(x3))
....: except:
....: print("error converting not_a_number")
....:
error converting not_a_number
Ideally, you will define some functions, and apply them one by one to create additional columns where you eventually hold the numerical values.
Problem 2: Polygons to map¶
Now you have the minimum and maximum values for X-coordinates, for Y-coordinates and for the damage values.
For the geographic coordinates, you need to build the polygons.
As there are meant as squares and are only given as bounding box values, we can immediately use Shapely’s box function.
df['geometry'] = df.apply( "... function that creates box from the available numerical coordinate values" )
Then you can create a proper GeoDataFrame. The coordinates were provided in the Estonian National Grid system, EPSG:3301.
Problem 3: Summarize damages per area¶
currently, you have more than 40000 records/polygons. However, as you might or might have not realized, there are many that have the same geometry, thus, offense reported in the same area.
Find a meaningful to summarize the average damages per square.