Independent Exercise
====================

This is an independent exercise for you to practice working with Python, DataFrames and geometry data. Results will be discussed in class but are not scored.
In this exercise you need to do some data cleaning and pre-processing in order to create geometries in Geopandas.

Data
----

Download the following CSV file: `https://opendata.smit.ee/ppa/csv/avalik_3.csv <https://opendata.smit.ee/ppa/csv/avalik_3.csv>`_ and save it into your working folder.

It is a dataset provided by the Information Technology and Development Center of the Ministry of the Interior of Estonia  (SMIT).  SMIT is the largest IT institution in the country, which creates and manages information systems necessary for saving lives and ensuring internal security. The data contains anonymised records of crimes, felonies and other offenses against the law.
The data is binned into squares of either 500mx500m or 1000mx1000m. However, whenever we work with datasets we can expect issues and errors with the data and have to be able to get the valuable information out.

Create a Jupyter notebook and import some of our default libraries

.. ipython:: python

    import numpy as np
    import pandas as pd
    import geopandas as gpd
    from shapely.geometry import box, Polygon



Sections
--------

 - Problem 1: Load and parse CSV file to get useful columns
 - Problem 2: Polygons to map
 - Problem 3: Summarize damages per area

Problem 1: Parse CSV file to get useful columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The main problem is that a) all fields are text/string fields and can't be acted upon immediately. Usually we need numerical data for analytical and geographical purposes.

In order to read even the CSV into Pandas we need to employ some low level commands:

.. code-block:: python

    # read the CSV and strip whitespace
    df = pd.read_table("avalik_3.csv").apply(lambda col: col.str.strip())


Now you have the dataframe to work with. Check things like 'df.head(5)', 'df.columns' and 'df.dtypes' to look at the data.

There are many interesting columns to analyse, like date and time of day of the offense, municipality, region or city, specific information about the type of offense, the damage caused and coordinate pairs. We will focus for now on the coordinate pairs **Lest_X** and **Lest_Y** as well as the damage caused **Kahjusumma**.

These fields are text/strings, and contain usually two numbers, linked with a dash/minus, e.g. '6589000-6589499' or '0-499'. In Python you can split strings using the `split-function <https://docs.python.org/3/library/stdtypes.html#str.split>`_.

.. ipython:: python

    # exemplary step to extract two coordinate values
    x_var = '6589000-6589499'
    # we split based on the '-'
    split_list = x_var.split('-')
    # split_list is a list, that holds all the elements that were split, it's length should be 2
    print(len(split_list))
    # the content of the list is:
    print(split_list)
    # now we can access each element by index
    print(split_list[0])

    a = split_list[0]
    b = split_list[1]


Now these variables are technically still text/string objects. In order to extract and convert the number, which is maintained as a text object currently, we can try to cast into a numerical datatypes.

If we take a whole Pandas DataFrame column we can use the `Pandas to_numeric <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html>`_ function.

.. code:: python

    df['x_min'] = pd.to_numeric(df['x_min_text'], errors='coerce')
    # or
    df['x_min'] = df['x_min_text'].apply(pd.to_numeric, errors='coerce')


If you'd like to handle that on a per-element-level, you could do the cast/conversion in your function, using Python's built-in cast functions `int() <https://docs.python.org/3/library/functions.html#int>`_ or `float() <https://docs.python.org/3/library/functions.html>`_:

.. ipython:: python

    # the elements from before ...
    print(type(split_list[0]))
    a = split_list[0]
    b = split_list[1]
    print(type(a))
    # now using
    a_int = int(a)
    print(type(a_int))


Sometimes the data is not in a form or value that can be cast/converted in to numbers. In this case will get an Error, a so called Exception. In Python you can easily ask for forgiveness using a try: / except: block:

.. ipython:: python

    good_text_int = "5"
    good_text_float = "5.5"
    not_a_number = "argl"
    # all still text/string
    print(type(good_text_int))
    try:
        x1 = int(good_text_int)
        print(x1)
        print(type(x1))
    except:
        print("error converting good_text_int")
    # this should work
    try:
        x2 = float(good_text_float)
        print(x2)
        print(type(x2))
    except:
        print("error converting good_text_float")
    # this will fail
    try:
        x3 = float(not_a_number)
        print(x3)
        print(type(x3))
    except:
        print("error converting not_a_number")


Ideally, you will define some functions, and apply them one by one to create additional columns where you eventually hold the numerical values.

Problem 2: Polygons to map
~~~~~~~~~~~~~~~~~~~~~~~~~~

Now you have the minimum and maximum values for X-coordinates, for Y-coordinates and for the damage values.

For the geographic coordinates, you need to build the polygons.

As there are meant as squares and are only given as bounding box values, we can immediately use `Shapely's box function <https://shapely.readthedocs.io/en/stable/manual.html#shapely.geometry.box>`_.

.. code:: python

    df['geometry'] = df.apply( "... function that creates box from the available numerical coordinate values" )

Then you can create a proper GeoDataFrame. The coordinates were provided in the Estonian National Grid system, **EPSG:3301**.

Problem 3: Summarize damages per area
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

currently, you have more than 40000 records/polygons. However, as you might or might have not realized, there are many that have the same geometry, thus, offense reported in the same area.

Find a meaningful to summarize the average damages per square.