OpenStreetMap and Urban Data

../_images/OSM_logo.png

What is OpenStreetMap?

OpenStreetMap (OSM) is a global collaborative (crowd-sourced) dataset and project that aims at creating a free editable map of the world containing a lot of information about our environment. It contains data for example about streets, buildings, different services, and landuse to mention a few.

OSM has a large userbase with more than 4 million users that contribute actively on OSM by updating the OSM database with 3 million changesets per day. In total OSM contains more than 4 billion nodes that form the basis of the digitally mapped world that OSM provides (stats from November 2017.

OpenStreetMap is used not only for integrating the OSM maps as background maps to visualizations or online maps, but also for many other purposes such as routing, geocoding, education, and research. OSM is also widely used for humanitarian response e.g. in crisis areas (e.g. after natural disasters) and for fostering economic development (see more from Humanitarian OpenStreetMap Team (HOTOSM) website.

Download and visualize OpenStreetMap data with osmnx

This lesson we will explore a new and exciting Python module called osmnx that can be used to retrieve, construct, analyze, and visualize street networks from OpenStreetMap. In short it offers really handy functions to download data from OpenStreet map, analyze the properties of the OSM street networks, and conduct network routing based on walking, cycling or driving.

There is also a scientific article available describing the package:

As said, one the most useful features that osmnx provides is an easy-to-use way of retrieving OpenStreetMap data (using OverPass API ).

Let’s see how we can download and visualize street network data from a district of Tartu in City Centre, Estonia. Osmnx makes it really easy to do that as it allows you to specify an address to retrieve the OpenStreetMap data around that area. In fact, osmnx uses the same Nominatim Geocoding API to achieve this which we tested during the Lesson 2.

Let’s retrieve OpenStreetMap (OSM) data by specifying "Tartu, Estonia" as the address where the data should be downloaded.

In [1]: import osmnx as ox

In [2]: import matplotlib.pyplot as plt

In [3]: place_name = "Tartu, Estonia"

In [4]: graph = ox.graph_from_place(place_name)

In [5]: type(graph)
Out[5]: networkx.classes.multidigraph.MultiDiGraph

Okey, as we can see the data that we retrieved is a special data object called networkx.classes.multidigraph.MultiDiGraph. A DiGraph is a data type that stores nodes and edges with optional data, or attributes. What we can see here is that this data type belongs to a Python module called networkx that can be used to create, manipulate, and study the structure, dynamics, and functions of complex networks. Networkx module contains algorithms that can be used for example to calculate shortest paths along networks using e.g. Dijkstra’s or A* algorithm.

Let’s see how our street network looks like. It is easy to visualize the graph with osmnx with plot_graph() function. The function utilizes Matplotlib for visualizing the data, hence as a result it returns a matplotlib figure and axis objects.

In [6]: fig, ax = ox.plot_graph(graph)

In [7]: plt.tight_layout()
../_images/graph_basic.png

Great! Now we can see that our graph contains the nodes (circles) and the edges (lines) that connects those nodes to each other.

Let’s explore the street network a bit. The library allows us to convert the nodes and edges dataset into two separate geodataframes.

In [8]: nodes, edges = ox.graph_to_gdfs(graph)

In [9]: print(edges.head(3))
                                                   osmid  oneway lanes ref  \
u       v          key                                                       
8220933 6854135436 0                           117178126    True     5   3   
        8220941    0    [731962213, 730852870, 29394711]   False     4   3   
8220941 332221897  0                            28906993    True     2   3   

                                     name highway maxspeed   length  \
u       v          key                                                
8220933 6854135436 0                 Riia   trunk       50   65.343   
        8220941    0    [Narva mnt, Riia]   trunk       50  137.133   
8220941 332221897  0            Narva mnt   trunk       40   55.405   

                                                                 geometry  \
u       v          key                                                      
8220933 6854135436 0    LINESTRING (26.73016 58.37857, 26.72998 58.378...   
        8220941    0    LINESTRING (26.73016 58.37857, 26.73027 58.378...   
8220941 332221897  0    LINESTRING (26.73056 58.37979, 26.73061 58.379...   

                       bridge junction service tunnel access width  
u       v          key                                              
8220933 6854135436 0      NaN      NaN     NaN    NaN    NaN   NaN  
        8220941    0      yes      NaN     NaN    NaN    NaN   NaN  
8220941 332221897  0      NaN      NaN     NaN    NaN    NaN   NaN  
print(edges.crs)
print(edges.head(3))

As you will see, there are several interesting columns, for example:

  • ‘oneway’, ‘lanes’, ‘name’, or ‘maxspeed’

But like with all data we get from somewhere else, we need to check datatypes and potentially clean up the data. Let’s look at the speed limit maxspeed information that is entered in the OSM dataset for many streets. Here is again a very manual and defensive way of making sure to convert the information in the maxspeed column into a predictable numerical format.

In [10]: def clean_speed(x):
   ....:     try:
   ....:         if isinstance(x, str):
   ....:             return float(x)
   ....:         elif isinstance(x, float):
   ....:             return x
   ....:         elif isinstance(x, int):
   ....:             return x
   ....:         elif isinstance(x, list):
   ....:             return clean_speed(x[0])
   ....:         elif x is None:
   ....:             return 20
   ....:         elif np.isnan(x):
   ....:             return 20
   ....:         else:
   ....:             return x
   ....:     except:
   ....:         print(type(x))
   ....:         return 20
   ....: edges['speed'] = edges['maxspeed'].apply(clean_speed)
   ....: 

Now let’s have a more colourful look at the streets of Tartu, based on the estimated max-speed attribute. Let’s also reproject to Estonian National Grid (EPSG:3301):

In [11]: streets_3301 = edges.to_crs(3301)

In [12]: streets_3301.plot(column='speed', cmap='viridis', legend=True)
Out[12]: <AxesSubplot:>

In [13]: plt.tight_layout()
../_images/speed_streets.png

It is also possible to retrieve other types of OSM data features with osmnx.

Let’s download the buildings with geometries_from_place() function and plot them on top of our street network in Tartu.

# specify that we're retrieving building footprint geometries
In [14]: tags = {'building': True}

In [15]: gdf = ox.geometries_from_place('Kesklinn, Tartu, Estonia', tags)
In [16]: gdf.head(3)
Out[16]: 
                                        source  \
element_type osmid                               
node         2345996395                    NaN   
way          30221557    Tartu City Government   
             30221588    Tartu City Government   

                                                                  geometry  \
element_type osmid                                                           
node         2345996395                          POINT (26.71552 58.37320)   
way          30221557    POLYGON ((26.71092 58.38214, 26.71117 58.38227...   
             30221588    POLYGON ((26.71623 58.38370, 26.71638 58.38364...   

                        man_made amenity operator wheelchair building  \
element_type osmid                                                      
node         2345996395      NaN     NaN      NaN        NaN      yes   
way          30221557        NaN     NaN      NaN        NaN   school   
             30221588        NaN     NaN      NaN        NaN      yes   

                                          name height level  ... grades  \
element_type osmid                                           ...          
node         2345996395  TÜ Geograafia osakond    NaN   NaN  ...    NaN   
way          30221557             Joosepi maja    NaN   NaN  ...    NaN   
             30221588                      NaN    NaN   NaN  ...    NaN   

                        community_centre automated payment:coins self_service  \
element_type osmid                                                              
node         2345996395              NaN       NaN           NaN          NaN   
way          30221557                NaN       NaN           NaN          NaN   
             30221588                NaN       NaN           NaN          NaN   

                        fixme building:flats ways abandoned:amenity type  
element_type osmid                                                        
node         2345996395   NaN            NaN  NaN               NaN  NaN  
way          30221557     NaN            NaN  NaN               NaN  NaN  
             30221588     NaN            NaN  NaN               NaN  NaN  

[3 rows x 103 columns]

As a result we got the data as GeoDataFrames. Hence, we can plot them using the familiar plot() function of Geopandas. However, the data comes in WGS84. For a cartographically good representation lower scales that keeps shapes from being distorted, a Marcator based projection is ok. OSMnx provides a project function, that selects the local UTM Zone as CRS and reprojects for us.

# projects in local UTM zone
In [17]: gdf_proj = ox.project_gdf(gdf)

In [18]: gdf_proj.crs
Out[18]: 
<Projected CRS: +proj=utm +zone=35 +ellps=WGS84 +datum=WGS84 +unit ...>
Name: unknown
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- undefined
Coordinate Operation:
- name: UTM zone 35N
- method: Transverse Mercator
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

OSMnx provides an adhoc plotting function for geometry footprints, esp. useful for our building footprints:

In [19]: fig, ax = ox.plot_footprints(gdf_proj, dpi=400, show=True)

In [20]: plt.tight_layout()
../_images/building_footsprints1.png

Nice! Now, as we can see, we have our OSM data as GeoDataFrames and we can plot them using the same functions and tools as we have used before.

Note

There are also other ways of retrieving the data from OpenStreetMap with osmnx such as passing a Polygon to extract the data from that area, or passing a Point coordinates and retrieving data around that location with specific radius. Take a look at te developer tutorials to find out how to use those features of osmnx.

Let’s create a map out of the streets and buildings but again exclude the nodes (to keep the figure clearer).

In [21]: buildings_3301 = gdf_proj.to_crs(3301)

In [22]: plt.style.use('dark_background')

In [23]: fig, ax = plt.subplots(figsize=(21,17))

In [24]: streets_3301.plot(ax=ax, linewidth=2, alpha=0.9, column='speed', cmap='Greens_r')
Out[24]: <AxesSubplot:>

In [25]: buildings_3301.plot(ax=ax, lw=4, edgecolor='lightgreen', facecolor='white', alpha=0.8)
Out[25]: <AxesSubplot:>

In [26]: plt.tight_layout()
../_images/building_footsprints3.png

Cool! Now we have a map where we have plotted the buildings, streets and the boundaries of the selected region of ‘Tartu’ in City Centre. And all of this required only a few lines of code.

Note

Try following task yourself - Explore the column highway in our edges GeoDataFrame, it contains information about the type of the street (such as primacy, cycleway or footway). Select the streets that are walkable or that can be used with cycle and visualize only them with the buildings and the area polygon. Use different colors and line widths for the cycleways and footways.

Hint

There are a few nice and convenient high-level functions in osmnx that can be used to produce nice maps directly just by using a single function that might be useful. If you are interested take a look of this tutorial. In the lesson we won’t cover these, because we wanted to keep as much control to ourselves as possible, hence using lower-level functions.

More dynamic urban data

From OSM we loaded comparatively static geographic data. Now we want to throw in some spatio-temporal data into the mix.

In this lesson’s data package is an export of a dataset from the Tartu City Government’s open data portal. If you haven’t downloaded the data yet, you can download (and then extract) the dataset zip-package from this link.

Originally, the dataset can only be accessed via a webservice API. This is an excerpt for the weeks 2021-03-02 to 2021-04-06 and contains traffic counter data, in 15 minute intervals for several streets and intersections in central Tartu - it counts the amounts of cars passing through. At the end of the lesson is an example how to access the API live with Python. But the time being let’s use the exported data now.

In [27]: import geopandas as gpd

In [28]: traffic_data = gpd.read_file('source/_static/data/L4/tartu_export_5km_buffer_week.gpkg', driver="GPKG", layer="liiklus")
In [29]: traffic_data.head(3)
Out[29]: 
          name  count  lane  objectid           time            text_time  \
0  LKL_Tiksoja      1     2    296992  1616452200041  2021-03-22 22:30:00   
1    LKL_Roopa     17     1    296993  1616435100041  2021-03-22 17:45:00   
2  LKL_Tiksoja      3     2    296994  1616455800041  2021-03-22 23:30:00   

                    geometry  
0  POINT (26.65439 58.40841)  
1  POINT (26.69739 58.34843)  
2  POINT (26.65439 58.40841)  
traffic_data.info()

traffic_data.describe()

The time field is currently a text field. Pandas supports proper datetime indexing and querying, so we want to convert the current string object into a datetime datatype. The easiest is to delegate the hard work of guessing snd trying the right format to ‘pandas’:

In [30]: import pandas as pd

In [31]: traffic_data['datetime'] = traffic_data['text_time'].apply(pd.to_datetime, infer_datetime_format=True )

For the ‘time’ column you could guess that it might look like an epoch time stamp. In general, at some point it makes sense to get more familiar with Python’s datetime functions and Pandas datetime indexing and parsing support. Then you could try to parse it as a datetime timestamp, like so:

traffic_data['datetime'] = pd.to_datetime(traffic_data['time'], unit='ms')

For a moment, you might have considered to use the chronological timestamps as an index, like for a time-series for a weather station? But in this example we can’t do that because technically each location is its own separate time-series, and right now we don’t even know how many distinct locations (in terms of identical Point geometries) we have.

As we have little knowledge about the dataset, we should try to get a feeling of value ranges, categories, times etc. For example, let’s check how many different names there are. Maybe it gives us a clue as to what the spatial spread might be:

In [32]: stations = len(traffic_data['name'].unique())

In [33]: print(stations)
16

In [34]: print(traffic_data['name'].value_counts())
LKL_Voru            1885
LKL_Narva_mnt       1881
LKL_Ilmatsalu       1881
LKL_Riia_mnt_LV     1880
LKL_Roomu           1880
LKL_Arukula         1879
LKL_Viljandi_mnt    1876
LKL_Riia_mnt_LS     1876
LKL_Rapina_mnt      1874
LKL_Tiksoja         1873
LKL_Ringtee_3       1872
LKL_Lammi           1871
LKL_Ihaste          1871
LKL_Ringtee_2       1871
LKL_Ringtee_1       1868
LKL_Roopa           1862
Name: name, dtype: int64

Officially a handful of distinct names. something we could now explore further as one way of aggregation. Let’s look at the value distribution per “name” class. While the classic “Matplotlib” has the whole toolbox to plot whatever you desire or imagine, sometimes it is nice to use some higher level packages so we don’t have to do all the coding by hand. Let me introduce to you the Seaborn package. Seaborn builds on top of the Python statistical and scientific ecosystem, and of course on top of Matplotlib, and provides a large number of high-quality and practical statistical plots and works nicely with dataframes. In the following image we are using a BoxPlot.

In [35]: import seaborn as sns

In [36]: import matplotlib.pyplot as plt

In [37]: fig, ax = plt.subplots(figsize=(17,13))

In [38]: sns.boxplot(x='name', y='count', data=traffic_data)
Out[38]: <AxesSubplot:xlabel='name', ylabel='count'>

In [39]: plt.title(f"Boxplot of Traffic Count values for different location names", fontsize='x-large')
Out[39]: Text(0.5, 1.0, 'Boxplot of Traffic Count values for different location names')

In [40]: plt.xticks(rotation=45, fontsize='large')
Out[40]: 
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
 [Text(0, 0, 'LKL_Tiksoja'),
  Text(1, 0, 'LKL_Roopa'),
  Text(2, 0, 'LKL_Lammi'),
  Text(3, 0, 'LKL_Viljandi_mnt'),
  Text(4, 0, 'LKL_Ringtee_3'),
  Text(5, 0, 'LKL_Riia_mnt_LS'),
  Text(6, 0, 'LKL_Ringtee_1'),
  Text(7, 0, 'LKL_Riia_mnt_LV'),
  Text(8, 0, 'LKL_Narva_mnt'),
  Text(9, 0, 'LKL_Roomu'),
  Text(10, 0, 'LKL_Ilmatsalu'),
  Text(11, 0, 'LKL_Ihaste'),
  Text(12, 0, 'LKL_Ringtee_2'),
  Text(13, 0, 'LKL_Arukula'),
  Text(14, 0, 'LKL_Voru'),
  Text(15, 0, 'LKL_Rapina_mnt')])

In [41]: plt.xlabel("location name", fontsize='x-large')
Out[41]: Text(0.5, 0, 'location name')

In [42]: plt.yticks(fontsize='large')
Out[42]: 
(array([-100.,    0.,  100.,  200.,  300.,  400.,  500.]),
 [Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, ''),
  Text(0, 0, '')])

In [43]: plt.ylabel("count", fontsize='x-large')
Out[43]: Text(0, 0.5, 'count')

In [44]: plt.tight_layout()
../_images/box_plots_name_all.png

It looks like the traffic counters at Narva street have a wider variation then most. Let’s explore only the counts for one “name” class for now. Later we can go back and apply the functions the whole dataset.

In [45]: narva_lk = traffic_data.loc[traffic_data['name'] == "LKL_Narva_mnt"].copy()

In [46]: fig, ax = plt.subplots(figsize=(17,13))

In [47]: sns.histplot(data=narva_lk, x="count")
Out[47]: <AxesSubplot:xlabel='count', ylabel='Count'>

In [48]: plt.title(f"Histogram of Traffic Count values for Narva Maantee location", fontsize='x-large')
Out[48]: Text(0.5, 1.0, 'Histogram of Traffic Count values for Narva Maantee location')

In [49]: plt.tight_layout()
../_images/narva_histplot1.png

The histogram above shows the general distribution of measurement values, obviously, in most cases there are none or only very few cars measured per time interval. Let’s make an aggregation for the hour of the day. Overall all days, take “sum()” for all counts that are during the same time of the day.

In [50]: narva_lk['hour_of_day'] = narva_lk['datetime'].apply(lambda dtx: dtx.hour)

In [51]: agg_narva_hours = []

In [52]: for hod, sub_df in narva_lk.groupby(by='hour_of_day'):
   ....:     all_counts = sub_df['count'].sum()
   ....:     agg_narva_hours.append( { 'hod': hod, 'count': all_counts}  )
   ....: 

In [53]: nava_hod_sum_df = pd.DataFrame(agg_narva_hours).set_index('hod')

In [54]: nava_hod_sum_df.plot(kind='bar')
Out[54]: <AxesSubplot:xlabel='hod'>

In [55]: plt.title(f"Bar plot of the sum of Traffic Count values for the Narva Maantee location, per hour of the day", fontsize=12)
Out[55]: Text(0.5, 1.0, 'Bar plot of the sum of Traffic Count values for the Narva Maantee location, per hour of the day')

In [56]: plt.tight_layout()
../_images/narva_histplot2.png

This is a bit easier, when you start getting more used to grouping and aggregating. You can write this now for the “average” (mean) in one line:

In [57]: narva_lk.groupby(by='hour_of_day')['count'].mean().plot(kind='bar')
Out[57]: <AxesSubplot:xlabel='hour_of_day'>

In [58]: plt.title(f"Bar plot of the mean of Traffic Count values for the Narva Maantee location, per hour of the day", fontsize=12)
Out[58]: Text(0.5, 1.0, 'Bar plot of the mean of Traffic Count values for the Narva Maantee location, per hour of the day')

In [59]: plt.tight_layout()
../_images/narva_histplot3.png

In the same way we should now also look at the weekday pattern.

In [60]: narva_lk['day_of_week'] = narva_lk['datetime'].apply(lambda dtx: dtx.day_of_week)

In [61]: fig, (ax_a, ax_b) = plt.subplots(1, 2, figsize=(18,15))

In [62]: narva_lk.groupby(by='day_of_week')['count'].sum().plot(kind='bar', ax=ax_a)
Out[62]: <AxesSubplot:xlabel='day_of_week'>

In [63]: narva_lk.groupby(by='day_of_week')['count'].mean().plot(kind='bar', ax=ax_b)
Out[63]: <AxesSubplot:xlabel='day_of_week'>

In [64]: plt.tight_layout()
../_images/narva_histplots4.png

Now we have drilled down into the data quite a bit. Now let’s figure out the spatial characteristics of the Narva station data.

In [65]: print(len(narva_lk['geometry'].unique() ))
6

Although they share the same station name, there are still a few different geometries in this data slice. However, we can’t aggregate directly on the Shapely geometry object type. But we know we can aggregate/groupby on strings. So now we use the WKT geometry text representation in order to group on those.

Shapely support loading and “dumping” geometries in WKT format.

In [66]: narva_lk['wkt'] = narva_lk['geometry'].apply(lambda g: g.wkt)

In [67]: narva_lk.groupby(by='wkt')['count'].mean().plot(kind='bar')
Out[67]: <AxesSubplot:xlabel='wkt'>

In [68]: plt.tight_layout()
../_images/narva_histplots_geom.png

Let’s now apply the steps on the whole dataset in order to make a map to visualise the traffic counts in Tartu on Tuesdays. At first we create our “day of week” column, and only subselect the observations for tuesdays (day=1). Secondly, we “destructure” our geometries to WKT for the spatial aggregation. We could potentially also use the station name, if we could agree that the single points are too close to each other to be meaningfully visualised separately.

In [69]: traffic_data['day_of_week'] = traffic_data['datetime'].apply(lambda dtx: dtx.day_of_week)

In [70]: tuesday_traffic = traffic_data.loc[traffic_data['day_of_week'] == 1].copy()

In [71]: tuesday_traffic['wkt'] = tuesday_traffic['geometry'].apply(lambda g: g.wkt)

Now we will groupby WKT (i.e. location) at first, then take the mean for all observations at a single point (the location), and the mean represent the average of counts on all Tuesdays. We rejoin them into a separate geodataframe which we need to reproject into EPSG:3301 (Estonia) to plotted together with our OSM urban data from above.

In [72]: from shapely import wkt

In [73]: agg_traffic_day = []

In [74]: for geom, sub_df in tuesday_traffic.groupby(by='wkt'):
   ....:     all_counts = sub_df['count'].mean()
   ....:     name = sub_df['name'].to_list()[0]
   ....:     agg_traffic_day.append( { 'name': name, 'count': all_counts, 'geometry': wkt.loads(geom)}  )
   ....: 

In [75]: tues_agg = pd.DataFrame(agg_traffic_day)

In [76]: tues_agg_3301 = gpd.GeoDataFrame(tues_agg, geometry='geometry', crs=4326).to_crs(3301)

Finally, we put the layers together in a single plot. The traffic counters seem to be distributed rather around the city than within.

In [77]: plt.style.use('dark_background')

In [78]: fig, ax = plt.subplots(figsize=(15,12))

In [79]: streets_3301.plot(ax=ax, linewidth=2, alpha=0.9, color="gray")
Out[79]: <AxesSubplot:>

In [80]: streets_3301.plot(ax=ax, linewidth=2, alpha=0.9, column='speed', cmap='Greens_r')
Out[80]: <AxesSubplot:>

In [81]: tues_agg_3301.plot(ax=ax, markersize=150, edgecolor='gray', lw=1, column="count", cmap="Reds", alpha=0.8, legend=True)
Out[81]: <AxesSubplot:>

In [82]: plt.title("Average traffic flow (number cars passing) on Tuesdays during March 2021")
Out[82]: Text(0.5, 1.0, 'Average traffic flow (number cars passing) on Tuesdays during March 2021')

In [83]: plt.tight_layout()
../_images/tartu_tuesday_traffic.png

Hint

This code block shows an example of how to access the data. Use with care, don’t overload the server with unbounded queries, such as too larger time spans for example. There are different query patterns possible as well, such as via envelope/bbox.

import requests
import datetime

query_url = "https://gis.tartulv.ee/arcgis/rest/services/Hosted/LiiklusLoendLaneBigDM/FeatureServer/0/query"

starttime = datetime.datetime(2021, 10, 4, 0, 0, 0)
endtime = datetime.datetime(2021, 10, 10, 23, 59, 59)
ts1 = int(starttime.timestamp()) * 1000
ts2 = int(endtime.timestamp()) * 1000

# available fields = "name":"LKL_Roomu","count":13,"globalid1":"{B253DBFB-8596-F347-7873-D9C433520079}","latitude":58.378952,"lane":1,"objectid":297054,"globalid":6.5485272E8,"longitude":26.777779,"time":1616432400041

params = {'f': 'geojson','returnGeometry': 'true', 'resultRecordCount': 40, 'where': "name='LKL_Roomu'", "time":"{t1}, {t2}", "outFields": "objectid,name,lane,count,time"}

req = requests.get(query_url, params=params)

import geopandas as gpd
gdata = req.json()

gdf = gpd.GeoDataFrame.from_features(gdata['features'])
gdf.crs = "EPSG:4326"