Independent Exercise

This is an independent exercise for you to practice working with Python, DataFrames and geometry data. Results will be discussed in class but are not scored. In this exercise you need to do some data cleaning and pre-processing in order to create geometries in Geopandas.


Download the following CSV file: and save it into your working folder.

It is a dataset provided by the Information Technology and Development Center of the Ministry of the Interior of Estonia (SMIT). SMIT is the largest IT institution in the country, which creates and manages information systems necessary for saving lives and ensuring internal security. The data contains anonymised records of crimes, felonies and other offenses against the law. The data is binned into squares of either 500mx500m or 1000mx1000m. However, whenever we work with datasets we can expect issues and errors with the data and have to be able to get the valuable information out.

Create a Jupyter notebook and import some of our default libraries

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import geopandas as gpd

In [4]: from shapely.geometry import box, Polygon


  • Problem 1: Load and parse CSV file to get useful columns

  • Problem 2: Polygons to map

  • Problem 3: Summarize damages per area

Problem 1: Parse CSV file to get useful columns

The main problem is that a) all fields are text/string fields and can’t be acted upon immediately. Usually we need numerical data for analytical and geographical purposes.

In order to read even the CSV into Pandas we need to employ some low level commands:

# read the CSV and strip whitespace
df = pd.read_table("avalik_3.csv").apply(lambda col: col.str.strip())

Now you have the dataframe to work with. Check things like ‘df.head(5)’, ‘df.columns’ and ‘df.dtypes’ to look at the data.

There are many interesting columns to analyse, like date and time of day of the offense, municipality, region or city, specific information about the type of offense, the damage caused and coordinate pairs. We will focus for now on the coordinate pairs Lest_X and Lest_Y as well as the damage caused Kahjusumma.

These fields are text/strings, and contain usually two numbers, linked with a dash/minus, e.g. ‘6589000-6589499’ or ‘0-499’. In Python you can split strings using the split-function.

# exemplary step to extract two coordinate values
In [5]: x_var = '6589000-6589499'

# we split based on the '-'
In [6]: split_list = x_var.split('-')

# split_list is a list, that holds all the elements that were split, it's length should be 2
In [7]: print(len(split_list))

# the content of the list is:
In [8]: print(split_list)
['6589000', '6589499']

# now we can access each element by index
In [9]: print(split_list[0])

In [10]: a = split_list[0]

In [11]: b = split_list[1]

Now these variables are technically still text/string objects. In order to extract and convert the number, which is maintained as a text object currently, we can try to cast into a numerical datatypes.

If we take a whole Pandas DataFrame column we can use the Pandas to_numeric function.

df['x_min'] = pd.to_numeric(df['x_min_text'], errors='coerce')
# or
df['x_min'] = df['x_min_text'].apply(pd.to_numeric, errors='coerce')

If you’d like to handle that on a per-element-level, you could do the cast/conversion in your function, using Python’s built-in cast functions int() or float():

# the elements from before ...
In [12]: print(type(split_list[0]))
<class 'str'>

In [13]: a = split_list[0]

In [14]: b = split_list[1]

In [15]: print(type(a))
<class 'str'>

# now using
In [16]: a_int = int(a)

In [17]: print(type(a_int))
<class 'int'>

Sometimes the data is not in a form or value that can be cast/converted in to numbers. In this case will get an Error, a so called Exception. In Python you can easily ask for forgiveness using a try: / except: block:

In [18]: good_text_int = "5"

In [19]: good_text_float = "5.5"

In [20]: not_a_number = "argl"

# all still text/string
In [21]: print(type(good_text_int))
<class 'str'>

In [22]: try:
   ....:     x1 = int(good_text_int)
   ....:     print(x1)
   ....:     print(type(x1))
   ....: except:
   ....:     print("error converting good_text_int")
<class 'int'>

# this should work
In [23]: try:
   ....:     x2 = float(good_text_float)
   ....:     print(x2)
   ....:     print(type(x2))
   ....: except:
   ....:     print("error converting good_text_float")
<class 'float'>

# this will fail
In [24]: try:
   ....:     x3 = float(not_a_number)
   ....:     print(x3)
   ....:     print(type(x3))
   ....: except:
   ....:     print("error converting not_a_number")
error converting not_a_number

Ideally, you will define some functions, and apply them one by one to create additional columns where you eventually hold the numerical values.

Problem 2: Polygons to map

Now you have the minimum and maximum values for X-coordinates, for Y-coordinates and for the damage values.

For the geographic coordinates, you need to build the polygons.

As there are meant as squares and are only given as bounding box values, we can immediately use Shapely’s box function.

df['geometry'] = df.apply( "... function that creates box from the available numerical coordinate values" )

Then you can create a proper GeoDataFrame. The coordinates were provided in the Estonian National Grid system, EPSG:3301.

Problem 3: Summarize damages per area

currently, you have more than 40000 records/polygons. However, as you might or might have not realized, there are many that have the same geometry, thus, offense reported in the same area.

Find a meaningful to summarize the average damages per square.