Point pattern analysis

Point pattern analysis is a set of statistical techniques used to study the spatial distribution of point events. It aims to determine if points are clustered, dispersed, or randomly distributed, and to identify the factors influencing their spatial arrangement. It’s used in diverse fields like ecology, epidemiology, criminology, and geography to analyze phenomena like disease outbreaks, crime hotspots, or species distributions.

Three main methods:

  1. Centrography
  2. Density based metrics
  3. Distance based metrics

We will use Tartu bike system GPS data for the analysis. The data originates from Tartu City Government and landscape fire data from Estonian Rescue Board

Download a Zipfile here.

import numpy as np
import pandas as pd
import geopandas as gpd
# import pysal -> typically you now import directly mapclassify or pointpats or esda ...
import libpysal as lp
import esda
import os
import seaborn as sns
import contextily
import matplotlib.pyplot as plt
from shapely.geometry import Point
from shapely.geometry import Polygon
from sklearn.cluster import DBSCAN
import cartopy.crs as ccrs
%matplotlib inline
points = pd.read_csv('rattaringluse_gps_valjavote_final.csv', encoding='utf8')
points.head(3)
cyclenumber time_stamp second unix_epoch X Y date time
0 2045 19/07/2019 07:02 0 1563519770 26.720922 58.380096 19/07/2019 07:02:50
1 2045 19/07/2019 07:41 0 1563522110 26.720424 58.380070 19/07/2019 07:41:50
2 2045 19/07/2019 08:11 0 1563523880 26.720045 58.380042 19/07/2019 08:11:20
len(points)
5379

We will check how many unique different bikes’ points are in the dataset.

# Get unique cycle numbers
unique_cycles = points['cyclenumber'].unique().tolist()

# Get the count of unique cycle numbers
num_unique_cycles = len(unique_cycles)

# Print the results
print(f"Unique cycle numbers: {unique_cycles}")
print(f"Number of unique cycle numbers: {num_unique_cycles}")
Unique cycle numbers: [2045, 2056, 2058, 2064, 2068, 2070, 2090, 2093, 2098, 2103, 2104, 2119, 2124, 2126, 2128, 2145, 2154, 2166, 2169, 2171, 2172, 2174, 2178, 2185, 2199, 2207, 2209, 2211, 2220, 2227, 2229, 2232, 2233, 2239, 2246, 2259, 2260, 2267, 2275, 2306, 2321, 2324, 2329, 2344, 2352, 2356, 2365, 2380, 2394, 2396, 2417, 2423, 2428, 2432, 2433, 2446, 2448, 2454, 2458, 2468, 2478, 2485, 2494, 2495, 2504, 2516, 2522, 2524, 2526, 2528, 2545, 2564, 2566, 2580, 2586, 2602, 2608, 2609, 2613, 2616, 2620, 2625, 2641, 2645, 2646, 2655, 2662, 2665, 2672, 2684, 2687, 2711, 2735, 2740, 2742, 2749, 2752, 2784]
Number of unique cycle numbers: 98

Now, we can check the time period for the dataset. The time is in column unix_epoch. Unix time is a date and time representation widely used in computing. It measures time by the number of non-leap seconds that have elapsed since 00:00:00 UTC on 1 January 1970 (Wiki).

# Calculate duration in hours
duration = (points['unix_epoch'].max() - points['unix_epoch'].min())/ (60 * 60)

# Print the result
print(f"Duration (hours): {duration}")
Duration (hours): 1.9944444444444445
# Make geometry column from X and Y columns
def create_geometry_from_xy(df, x_col='x', y_col='y', crs="EPSG:4326"):
    try:
        geometry = [Point(xy) for xy in zip(df[x_col], df[y_col])]
        gdf = gpd.GeoDataFrame(df, geometry=geometry, crs=crs)
        return gdf
    except KeyError as e:
        print(f"Error: Column '{e.args[0]}' not found in DataFrame.")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None
bike_points = create_geometry_from_xy(points, x_col='X', y_col='Y', crs="EPSG:4326")
bike_points.head(3)
cyclenumber time_stamp second unix_epoch X Y date time geometry
0 2045 19/07/2019 07:02 0 1563519770 26.720922 58.380096 19/07/2019 07:02:50 POINT (26.72092 58.3801)
1 2045 19/07/2019 07:41 0 1563522110 26.720424 58.380070 19/07/2019 07:41:50 POINT (26.72042 58.38007)
2 2045 19/07/2019 08:11 0 1563523880 26.720045 58.380042 19/07/2019 08:11:20 POINT (26.72005 58.38004)
bike_points.plot(marker='o', color="red", markersize=1.0)

#Lets plot the gps points of each bike with different color

unique_cycles = bike_points['cyclenumber'].unique()
norm = plt.Normalize(vmin=bike_points['cyclenumber'].min(), vmax=bike_points['cyclenumber'].max())

fig, ax = plt.subplots(1, 1)

for cycle in unique_cycles:
    cycle_points = bike_points[bike_points['cyclenumber'] == cycle]
    color = plt.cm.tab20(norm(cycle)) 
    cycle_points.plot(ax=ax, marker='o', color=color, markersize=1.0, label=f'Cycle {cycle}')
#ax.legend(title='cyclenumber')
plt.show()

Centrography

Centrography is a set of spatial statistical techniques used to describe the central tendency and dispersion of point patterns or geographic distributions. It provides quantitative measures of central tendency and dispersion:

Mean center: The average location of the points.

Median center: The point that minimizes the total distance to all other points.

Standard deviational ellipse: A representation of the spatial spread and orientation of the points.

We will first calculate mean and median center of the gps points.

from pointpats import centrography
from matplotlib.patches import Ellipse

#calculate spatial mean and median of the points
mean_center = centrography.mean_center(bike_points[["X", "Y"]])
med_center = centrography.euclidean_median(bike_points[["X", "Y"]])

print ("Mean center is", mean_center)
print ("Median center is", med_center)
Mean center is [26.72938143 58.37326066]
Median center is [26.72919961 58.37584394]

In point pattern analysis, the standard deviational ellipse (SDE) provides a visual and statistical representation of the spatial dispersion and orientation of a set of points. The size of the ellipse indicates the overall spread of the point pattern. A larger ellipse signifies greater dispersion, while a smaller ellipse suggests a more concentrated pattern. The angle of the ellipse’s major axis reveals the primary direction or orientation of the point distribution. It shows the trend or elongation of the pattern

# calculate standard ellipse
major, minor, rotation = centrography.ellipse(bike_points[["X", "Y"]])
# Create the base for the figure - subplots and size
fig, ax = plt.subplots(1, figsize=(10, 10))
ax.set_aspect('equal') #this important to keep the proportions of the projected map correct!
# Plot bike points
ax.scatter(bike_points["X"], bike_points["Y"], s=0.2)
ax.scatter(*mean_center, color="red", marker="D", label="Mean center", zorder=2)
ax.scatter(
    *med_center, marker="*", color="purple", label="Median center", zorder=2
)
# Construct the standard ellipse using matplotlib
ellipse = Ellipse(
    xy=mean_center,  # center the ellipse to the mean center
    width=major * 2,  # centrography.ellipse gives only half of the axis and we need to multiply with 2
    height=minor * 2,
    angle=np.rad2deg(rotation),  # Centrography gives angle in radians. Convert radians to degrees
    facecolor="darkred",
    edgecolor="darkred",
    linewidth=2,
    linestyle="-.",
    alpha=0.3,
    label="Standard ellipse",
    zorder=1,
)
ax.add_patch(ellipse)

ax.legend()

# Add basemap
contextily.add_basemap(
    ax, 
    crs="EPSG:4326",
    source=contextily.providers.CartoDB.Voyager
)
plt.show()

Convex hull is the smallest convex polygon that encloses all the points in a dataset. It effectively creates a “boundary” around the point distribution, showing the outermost extent of the pattern. It’s used to define and visualise the spatial extent but also provides a way to identify outliers. Points on the hull are the most extreme, potentially highlighting outliers.

Let’s compare two different bikes’ convex hulls. First we have to extract two different bikes from the initial dataframe. Let’s choose bikes number 2735 and 2524

#Extract bike number 2375 data
bike_1 = bike_points[bike_points['cyclenumber'] == 2735]
coords_1 = bike_1[["X", "Y"]].values

#Extract bike number 2119 data
bike_2 = bike_points[bike_points['cyclenumber'] == 2119]
coords_2 = bike_2[["X", "Y"]].values

Next we need to calculate coordinates of the convex hulls based on the bikes’ coordinates. For that we can use centrography.hull method.

#calculate convex hull coordinates for bike 2735
convex_hull_coords1 = centrography.hull(coords_1)

#calculate convex hull coordinates for bike 2524
convex_hull_coords2 = centrography.hull(coords_2)

Even more accurate way to represent the shape of a point cloud is alpha shape. An alpha shape provides a way to create more detailed and potentially concave boundaries around a point cloud. Alpha shape can have concave parts while convex hull can’t. In fact, every convex hull is an alpha shape, but not every alpha shape is a convex hull.

#calculate aplha shapes for bikes 2735 and 2119
alpha_shape1 = lp.cg.alpha_shape_auto(coords_1)
alpha_shape2 = lp.cg.alpha_shape_auto(coords_2)

Let’s add alpha shapes and convext hulls together with the bikes’ gps points to the map.

fig, ax = plt.subplots(1, figsize=(10, 10))
ax.set_aspect('equal')
# Add bike 2735 gps points
ax.scatter(
    *coords_1.T, color="blue", marker=".", label="Bike 2735"
)

# Add bike 2119 gps points
ax.scatter(
    *coords_2.T, color="red", marker=".", label="Bike 2119"
)

# Add bike 2735 alpha shape
gpd.GeoSeries(
    [alpha_shape1]
).plot(
    ax=ax,
    edgecolor="blue",
    facecolor="blue",
    alpha=0.1,
    label="Alpha shape of bike 2735"
)


# Add bike 2735 convex hull
ax.add_patch(
    plt.Polygon(
        convex_hull_coords1,
        closed=True,
        edgecolor="blue",
        facecolor="none",
        linestyle="--",
        linewidth=1,
        label="Convex Hull"
    )
)

# Add bike 2119 alpha shape
gpd.GeoSeries(
    [alpha_shape2]
).plot(
    ax=ax,
    edgecolor="red",
    facecolor="red",
    alpha=0.1,
    label="Alpha shape of bike 2119",
)
# Add bike 2119 convex hull
ax.add_patch(
    plt.Polygon(
        convex_hull_coords2,
        closed=True,
        edgecolor="red",
        facecolor="none",
        linestyle="--",
        linewidth=1,
        label="Convex Hull2"
    )
)
# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Voyager,
    crs="EPSG:4326"
)
plt.legend()
/tmp/ipykernel_426/3967103221.py:65: UserWarning: Legend does not support handles for PatchCollection instances.
See: https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html#implementing-a-custom-legend-handler
  plt.legend()

Visually, we can see that these two convex hulls don’t overlap. To be sure, let’s check with intersect.

#make shapely polygons from the convex hull coordinates
p1 = Polygon(convex_hull_coords1)
p2 = Polygon(convex_hull_coords2)

#test intersection
print(p1.intersects(p2))
False

Density based metrics

Density based techniques characterize the pattern in terms of its distribution in the study area. The two most commonly used density-based measures are quadrat density and kernel density.

Quadrat density

the study area is divided into smaller sub-regions (i.e., quadrats), and then the point density is computed for each sub-region by dividing the number of points in each quadrat by the quadrat’s area. Quadrats can take on many different shapes such as hexagons and triangles. Lets create a hexagon binning for the gps points.

fig, ax = plt.subplots(1, figsize=(12,10))
ax.set_aspect('equal')
X = bike_points['X']
Y = bike_points['Y']

hex= ax.hexbin(X, 
               Y, 
               gridsize=30, #sets the size of the hexagons. Try to change it
               cmap='plasma', 
               alpha=0.5, 
               linewidths=0)

plt.colorbar(hex,label='Count of points')
plt.title('Bike gps point density in hexagons')
plt.xlabel('Longitude')
plt.ylabel('Latitude')


# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.OpenStreetMap.Mapnik,
    crs="EPSG:4326"
)

Kernel density

Kernel density uses a moving window, which is defined by a kernel. The kernel density approach generates a grid of density values whose cell size is smaller than the kernel window. Each cell is assigned the density value computed for the kernel window centered on that cell. Instead of binning data into discrete bins like hexagons, KDE (Kernel Density Estimation) plots create a smooth, continuous curve representing the data’s estimated probability density. Common kernels include the Gaussian (normal) kernel. A kernel is placed over each data point, and these kernels are summed to create the overall density estimate. KDE plots provide a smoother representation of the data’s distribution compared to quadrats, which can be affected by the choice of bin size. This smoothness can reveal underlying patterns that might be obscured in quadrats.

Let’s create kernel density estimation for our bike points. For this, we will use seaborn.kdeplot.

fig, ax = plt.subplots(1, figsize=(12,10))
ax.set_aspect('equal')

# fig, ax = plt.subplots(1, figsize=(12,10))

sns.kdeplot(
    x=bike_points['X'], 
    y=bike_points['Y'], 
    fill=True, 
    cmap='viridis', 
    levels=20, #Number of contour levels or values to draw contours at. Try to change this value.
    alpha=0.7,
    ax=ax
)
plt.title('Kernel density of bike GPS points')
plt.xlabel('Longitude')
plt.ylabel('Latitude')

# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Voyager,
    crs="EPSG:4326"
)

#Add additional basemap consisting only labels. As the KDE covers some relevant information like labels, we can add another layer consisting only of labels.
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.PositronOnlyLabels,
    crs="EPSG:4326"
)

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points that are closely packed together (points with many nearby neighbours), marking as outliers points that lie alone in low-density regions (whose nearest neighbours are too far away). This algorithm is good for data which contains clusters of similar density.

Two most essential parameters in DBSCAN are eps and min_samples.

eps(epsilon) determines how close points need to be considered “neighbours.” When DBSCAN examines a point, it checks all other points within a distance of eps from that point. eps essentially controls the “reach” or “spread” of clusters. A small eps value will lead to fewer points being considered neighbours, smaller and denser clusters and more points being classified as noise (outliers).

min_samples specifies the minimum number of points that must be within a point’s eps neighbourhood for that point to be considered a core point. It sets a density threshold. A point must have at least min_samples neighbours within the eps radius to be considered part of a dense region. A small min_samples value will lead to more points being classified as core points and less dense clusters being formed.

DBSCAN identifies: 1) Core points - A point is classified as a “core point” if it has at least min_samples points within its eps neighbourhood. They form the core of the clusters. 2) Non-core points or norder points - are data points that are within the eps neighbourhood of core points, but it they don’t have min_samples points within their own eps neighbourhood. Border points are considered part of a cluster but are not as central as core points. 3) Noise points - A noise point is a data point that is neither a core nor a border point. Points within each other’s eps neighbourhoods are considered part of the same cluster.

#data_points = bike_points.to_crs(epsg=3301)
#find clusters for the bike points
db = DBSCAN(eps=0.001, min_samples=30).fit(bike_points[["X", "Y"]]) #Try to change eps and min_samples values
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
Estimated number of clusters: 19
Estimated number of noise points: 2896
bike_points
cyclenumber time_stamp second unix_epoch X Y date time geometry
0 2045 19/07/2019 07:02 0 1563519770 26.720922 58.380096 19/07/2019 07:02:50 POINT (26.72092 58.3801)
1 2045 19/07/2019 07:41 0 1563522110 26.720424 58.380070 19/07/2019 07:41:50 POINT (26.72042 58.38007)
2 2045 19/07/2019 08:11 0 1563523880 26.720045 58.380042 19/07/2019 08:11:20 POINT (26.72005 58.38004)
3 2045 19/07/2019 08:11 0 1563523890 26.719941 58.380027 19/07/2019 08:11:30 POINT (26.71994 58.38003)
4 2045 19/07/2019 08:11 0 1563523900 26.719004 58.379858 19/07/2019 08:11:40 POINT (26.719 58.37986)
... ... ... ... ... ... ... ... ... ...
5374 2784 19/07/2019 08:47 0 1563526050 26.721271 58.375757 19/07/2019 08:47:30 POINT (26.72127 58.37576)
5375 2784 19/07/2019 08:47 0 1563526060 26.721718 58.376008 19/07/2019 08:47:40 POINT (26.72172 58.37601)
5376 2784 19/07/2019 08:47 0 1563526070 26.722585 58.376385 19/07/2019 08:47:50 POINT (26.72259 58.37638)
5377 2784 19/07/2019 08:48 0 1563526090 26.724580 58.377187 19/07/2019 08:48:10 POINT (26.72458 58.37719)
5378 2784 19/07/2019 08:48 0 1563526100 26.725410 58.377520 19/07/2019 08:48:20 POINT (26.72541 58.37752)

5379 rows × 9 columns

We can see that there are 19 clusters and a lot of noise points. Let’s plot them.

fig, ax = plt.subplots(1, figsize=(12,7))
ax.set_aspect('equal')
unique_labels = set(labels)
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True


colors = [plt.cm.Paired(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = labels == k

    #plotting clusters
    xy = bike_points[class_member_mask & core_samples_mask]
    ax.plot(
        xy.loc[:, "X"],
        xy.loc[:, "Y"],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="black",
        markersize=7,
        markeredgewidth=0.5,
        zorder=2
    )
    #plotting noise
    xy = bike_points[class_member_mask & ~core_samples_mask]
    ax.plot(
        xy.loc[:, "X"],
        xy.loc[:, "Y"],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="black",
        markersize=2,
        markeredgewidth=0.5,
        zorder=1
    )


# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Voyager,
    crs="EPSG:4326"
)

plt.title(f"Estimated number of bike GPS point clusters: {n_clusters_}")
Text(0.5, 1.0, 'Estimated number of bike GPS point clusters: 19')

Core samples (large dots) and non-core samples (small dots) are color-coded according to the assigned cluster. Samples tagged as noise are represented in black.

We can see that the bike data is not the easiest to find clusters because there is high connectivity along the bike trails and not very clear clustering exists. Let’s try another dataset for DBSCAN clustering. Let’s use OpenStreetMap data set to map the shops of Tartu and see if there is any clustering.

import osmnx as ox

# Get place boundary related to the place name as a geodataframe
place_name = 'Tartu'
area = ox.geocode_to_gdf(place_name)

area.plot()

# Extract shops from the OSM dataset
tags = {'shop': True}   

shops = ox.features_from_place('Tartu', tags)
shops.plot(color='orange', markersize=1)
shops.head(5)
geometry addr:city addr:country addr:housenumber addr:postcode addr:street check_date email name opening_hours ... fuel:lpg fuel:taxfree_diesel fuel:adblue shower old_addr:street communication:mobile_phone service layer air_conditioning toilets:wheelchair
element id
node 266921299 POINT (26.67666 58.35805) Tartu EE 75 50501 Ringtee 2024-08-11 info@lounakeskus.com Lõunakeskus Mo-Su 10:00-21:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
322906403 POINT (26.73024 58.38247) Tartu NaN 20 50603 Raatuse NaN NaN Raatuse Selver ABC Mo-Su 09:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
331179287 POINT (26.71956 58.36477) Tartu EE 55f NaN Võru 2024-07-02 NaN Maxima XX Mo-Su 08:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
354470712 POINT (26.71226 58.37624) Tartu NaN 20 50409 J. Kuperjanovi 2022-12-22 NaN Alko Feenoks Mo-Su 10:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
361134472 POINT (26.73618 58.37472) Tartu NaN 14 NaN Turu 2023-01-04 NaN Maxima XX Mo-Su 08:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 147 columns

This time we will transform the WGS84 data to Estonian CRS

#reproject the shops data to Estonian crs
proj = shops.to_crs(epsg=3301)
proj.head(3)
geometry addr:city addr:country addr:housenumber addr:postcode addr:street check_date email name opening_hours ... fuel:lpg fuel:taxfree_diesel fuel:adblue shower old_addr:street communication:mobile_phone service layer air_conditioning toilets:wheelchair
element id
node 266921299 POINT (656645.839 6471739.365) Tartu EE 75 50501 Ringtee 2024-08-11 info@lounakeskus.com Lõunakeskus Mo-Su 10:00-21:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
322906403 POINT (659668.71 6474583.027) Tartu NaN 20 50603 Raatuse NaN NaN Raatuse Selver ABC Mo-Su 09:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
331179287 POINT (659124.663 6472588.706) Tartu EE 55f NaN Võru 2024-07-02 NaN Maxima XX Mo-Su 08:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 147 columns

#make geometry for the data
shop_points = proj[proj['geometry'].apply(lambda geom: isinstance(geom, Point))].copy()
shop_points['x'] = shop_points['geometry'].x
shop_points['y'] = shop_points['geometry'].y
shop_points.head(3)
geometry addr:city addr:country addr:housenumber addr:postcode addr:street check_date email name opening_hours ... fuel:adblue shower old_addr:street communication:mobile_phone service layer air_conditioning toilets:wheelchair x y
element id
node 266921299 POINT (656645.839 6471739.365) Tartu EE 75 50501 Ringtee 2024-08-11 info@lounakeskus.com Lõunakeskus Mo-Su 10:00-21:00 ... NaN NaN NaN NaN NaN NaN NaN NaN 656645.838896 6.471739e+06
322906403 POINT (659668.71 6474583.027) Tartu NaN 20 50603 Raatuse NaN NaN Raatuse Selver ABC Mo-Su 09:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN 659668.709600 6.474583e+06
331179287 POINT (659124.663 6472588.706) Tartu EE 55f NaN Võru 2024-07-02 NaN Maxima XX Mo-Su 08:00-22:00 ... NaN NaN NaN NaN NaN NaN NaN NaN 659124.662764 6.472589e+06

3 rows × 149 columns

As the coordinates change significantly then the eps need to be increased also significantly.

db = DBSCAN(eps=500, min_samples=10).fit(shop_points[["x", "y"]]) #Try to change eps and min_samples values
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
Estimated number of clusters: 6
Estimated number of noise points: 47
fig, ax = plt.subplots(1, figsize=(12,7))
ax.set_aspect('equal')
unique_labels = set(labels)
core_samples_mask = np.zeros_like(labels, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True


colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = labels == k

    xy = shop_points[class_member_mask & core_samples_mask]
    ax.plot(
        xy.loc[:, "x"],
        xy.loc[:, "y"],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="black",
        markersize=10,
        markeredgewidth=0.5
    )

    xy = shop_points[class_member_mask & ~core_samples_mask]
    ax.plot(
        xy.loc[:, "x"],
        xy.loc[:, "y"],
        "o",
        markerfacecolor=tuple(col),
        markeredgecolor="black",
        markersize=4,
        markeredgewidth=0.5
    )


# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Voyager,
    crs="EPSG:3301"
)
plt.title(f"Estimated number of clusters: {n_clusters_}")
Text(0.5, 1.0, 'Estimated number of clusters: 6')

Distance based metrics

In the distance based methods, the interest lies in how the points are distributed relative to one another as opposed to how the points are distributed relative to the study extent. We will look into following methods: nearest neighbour distance (NND), G function and hotspot analysis.

For distance based metrics, we will use the landscape fires dataset.

from pointpats import PointPattern 
from pointpats import distance_statistics

fires = gpd.read_file("est-wild_fires.shp")
fires.head(3)
sundmuse_n sundmuse_k tulekahju_ sundmuse_a mis_poles wgs_latitu wgs_longit maakond kov korgem_ast ... likvideeri viimase_re sundmuse_1 valjasoidu valjasoitn annulleeri kohal_ress x y geometry
0 2166474 2021-04-25 Metsa-ja maastikutulekahju Kulu Kogutud kuhilad väljaspool rajatisi 58.946 23.538 Lääne maakond Haapsalu linn PAASTE_1A ... None 2021-04-25 13:06:47 00:13:42 1 1.0 NaN 1.0 473406.543225 6.534192e+06 POINT (473406.543 6534192.248)
1 2166232 2021-04-25 Metsa-ja maastikutulekahju Mets/maastik None 59.395 27.281 Ida-Viru maakond Kohtla-Järve linn PAASTE_1 ... 2021-04-25 09:20:13 2021-04-25 09:30:26 02:09:11 2 2.0 NaN 2.0 686340.270306 6.588675e+06 POINT (686340.27 6588675.282)
2 2165241 2021-04-23 Metsa-ja maastikutulekahju Mets/maastik None 59.318 24.229 Harju maakond Lääne-Harju vald PAASTE_2 ... None 2021-04-23 22:31:33 00:33:50 8 6.0 6.0 2.0 513040.214259 6.575561e+06 POINT (513040.214 6575561.392)

3 rows × 25 columns

Let’s plot the landscape fires.

fig, ax = plt.subplots(1, figsize=(12,7))
ax.set_aspect('equal')

#ax.plot(color='orange', markersize=1)
ax.scatter(fires["x"], fires["y"], s=1, color="red")
# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Voyager,
    crs="EPSG:3301"
)
plt.title("Landscape fires")
Text(0.5, 1.0, 'Landscape fires')

Nearest-Neighbor Distance (NND) is the distance between a point and its closest neighboring point. NND is also known as the first-order nearest neighbor. In addition, distance can be calculated for the kth nearest neighbor, which is called the kth-order NN or KNN. The mean of NND between all point pairs is used as a global indicator to measure the overall pattern of a point set.

To calculate NND, we need to firstly make our shop_points x and y from the geodataframe as PointPattern object that pysal can use.

pp = PointPattern(fires[["x", "y"]])

#calculate mean nearest neighbour of the whole study area
pp.mean_nnd
np.float64(3847.449077094534)

On average, the each fire has its nearest neighbour in 3847m.

G-function The G-function, also known as the nearest neighbour distance distribution function, is a spatial point pattern analysis tool that describes the distribution of distances from each point in a dataset to its nearest neighbour. It’s used to assess whether a point pattern is clustered, dispersed (regular), or random. The G-function calculates the cumulative distribution of nearest neighbour distances, which tells you the proportion of points that have a nearest neighbour within a certain distance. In other words, G(r) is the probability that a randomly chosen point has its nearest neighbour within a distance r.

By comparing the observed G-function to a theoretical G-function (under the assumption of complete spatial randomness, CSR), you can infer the spatial pattern of your points. Under CSR, the points are assumed to be randomly and independently distributed.

Clustered: If the observed G-function is above the expected G-function under CSR, it suggests that the points are clustered. This means that, on average, points are closer to their nearest neighbours than expected by chance.

Dispersed: If the observed G-function is below the expected G-function under CSR, it suggests that the points are dispersed or regularly spaced. This means that, on average, points are farther from their nearest neighbours than expected by chance.

g_test = distance_statistics.g_test(
    (fires[["x", "y"]]), #refer to the points' coordinates to which to calculate distances
    keep_simulations=True  #this simulates CSR
)

print(g_test)
GtestResult(support=array([    0.        ,  1845.7772965 ,  3691.554593  ,  5537.3318895 ,
        7383.109186  ,  9228.88648249, 11074.66377899, 12920.44107549,
       14766.21837199, 16611.99566849, 18457.77296499, 20303.55026149,
       22149.32755799, 23995.10485449, 25840.88215099, 27686.65944748,
       29532.43674398, 31378.21404048, 33223.99133698, 35069.76863348]), statistic=array([0.        , 0.50251256, 0.66834171, 0.76633166, 0.82663317,
       0.87688442, 0.89949749, 0.92713568, 0.94974874, 0.96733668,
       0.9798995 , 0.98743719, 0.98994975, 0.99246231, 0.99748744,
       0.99748744, 0.99748744, 0.99748744, 0.99748744, 1.        ]), pvalue=array([0.000e+00, 1.000e-04, 1.000e-04, 1.000e-04, 1.000e-04, 1.000e-04,
       2.680e-02, 2.831e-01, 1.720e-02, 2.000e-03, 9.000e-04, 9.000e-04,
       1.000e-04, 2.000e-04, 4.200e-03, 6.000e-04, 3.000e-04, 1.000e-04,
       0.000e+00, 0.000e+00]), simulations=array([[0.        , 0.08291457, 0.23869347, ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.06281407, 0.20351759, ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.04522613, 0.19346734, ..., 1.        , 1.        ,
        1.        ],
       ...,
       [0.        , 0.06030151, 0.19346734, ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.04522613, 0.20854271, ..., 1.        , 1.        ,
        1.        ],
       [0.        , 0.04522613, 0.18844221, ..., 1.        , 1.        ,
        1.        ]], shape=(9999, 20)))
fig, ax = plt.subplots(1, figsize=(10,7))

# plot the observed pattern's G function
ax.plot(
    g_test.support, 
    g_test.statistic, 
    label="observed", 
    color="red",
    zorder=3
)

# Plot all the simulations with very fine lines
ax.plot(
    g_test.support, 
    g_test.simulations.T, 
    color="lightgray", 
    alpha=0.01,
    zorder=2
)
# Plot the median of simulations
ax.plot(
    g_test.support,
    np.median(g_test.simulations, axis=0),
    color="black",
    label="median simulation",
    zorder=1
)

# Add labels to axes and figure title
ax.set_xlabel("Distance (m)")
ax.set_ylabel("Proportion of points that have a nearest neighbour \nwithin this distance")
ax.legend()
ax.set_xlim(0, 20000)
ax.set_title("Ripley's $G(d)$ function for landscape fires")

plt.show()

Our G-function analysis shows that the fire points are clustered until around 12000m neighbourhood and further away they are rather dispersed. That can be also see in visual inspection as they form clusters around cities and further away there are large areas without almost any fires or only few fires.

Hotspot analysis is a spatial statistical technique used to identify clusters of high or low values within a geographic dataset. The difference with other clustering methods is that it also takes into account the data attribute value, i.e. if the attribute value is high or low, and clusters high and low values. It uses Getis-Ord Gi* statistic to identify clusters. It calculates a z-score for each feature in a dataset, indicating whether it is part of a statistically significant hotspot or coldspot.   Hotspots are areas where high values cluster, and coldspots are areas where low values cluster.

Hotspot analysis needs one attribute value, and we will use ‘valjasoitn’ for that purpose. ‘valjasoitn’ shows how many Rescue Board resources were actually deployed to the event, which shows the magnitude of the event indirectly. The higher the number, the bigger the fire.

import esda
# lp is still libpysal from above

# Create a spatial weights matrix
w = lp.weights.DistanceBand.from_dataframe(fires, threshold=30000) # 1000 is an example distance. Adjust as needed.

# Perform Getis-Ord Gi* analysis
gi = esda.G_Local(fires['valjasoitn'], w)

# Add Gi* results to the GeoDataFrame
fires['Gi_zscore'] = gi.z_sim
fires['Gi_pvalue'] = gi.p_sim

# Classify hotspots and coldspots and add them to the dataframe
fires['hotspot'] = ['hotspot' if z > 1.96 and p < 0.05 else 'coldspot' if z < -1.96 and p < 0.05 else 'not significant' for z, p in zip(fires['Gi_zscore'], fires['Gi_pvalue'])]

fires.head()
/usr/local/lib/python3.13/site-packages/libpysal/weights/util.py:826: UserWarning: The weights matrix is not fully connected: 
 There are 2 disconnected components.
 There is 1 island with id: 68.
  w = W(neighbors, weights, ids, **kwargs)
/usr/local/lib/python3.13/site-packages/libpysal/weights/distance.py:844: UserWarning: The weights matrix is not fully connected: 
 There are 2 disconnected components.
 There is 1 island with id: 68.
  W.__init__(
/usr/local/lib/python3.13/site-packages/esda/getisord.py:527: RuntimeWarning: invalid value encountered in divide
  z_scores = (statistic - expected_value) / np.sqrt(expected_variance)
('WARNING: ', 68, ' is an island (no neighbors)')
/usr/local/lib/python3.13/site-packages/esda/getisord.py:450: RuntimeWarning: invalid value encountered in divide
  self.z_sim = (self.Gs - self.EG_sim) / self.seG_sim
sundmuse_n sundmuse_k tulekahju_ sundmuse_a mis_poles wgs_latitu wgs_longit maakond kov korgem_ast ... valjasoidu valjasoitn annulleeri kohal_ress x y geometry Gi_zscore Gi_pvalue hotspot
0 2166474 2021-04-25 Metsa-ja maastikutulekahju Kulu Kogutud kuhilad väljaspool rajatisi 58.946 23.538 Lääne maakond Haapsalu linn PAASTE_1A ... 1 1.0 NaN 1.0 473406.543225 6.534192e+06 POINT (473406.543 6534192.248) 3.487929 0.002 hotspot
1 2166232 2021-04-25 Metsa-ja maastikutulekahju Mets/maastik None 59.395 27.281 Ida-Viru maakond Kohtla-Järve linn PAASTE_1 ... 2 2.0 NaN 2.0 686340.270306 6.588675e+06 POINT (686340.27 6588675.282) 0.678703 0.240 not significant
2 2165241 2021-04-23 Metsa-ja maastikutulekahju Mets/maastik None 59.318 24.229 Harju maakond Lääne-Harju vald PAASTE_2 ... 8 6.0 6.0 2.0 513040.214259 6.575561e+06 POINT (513040.214 6575561.392) -2.604398 0.004 coldspot
3 2164552 2021-04-22 Metsa-ja maastikutulekahju Kulu Kogutud kuhilad väljaspool rajatisi 58.881 25.572 Järva maakond Paide linn PAASTE_1 ... 3 3.0 1.0 2.0 590648.922000 6.527923e+06 POINT (590648.922 6527922.759) 0.394463 0.325 not significant
4 2164378 2021-04-22 Metsa-ja maastikutulekahju Kulu Kulu 59.360 26.986 Ida-Viru maakond Lüganuse vald PAASTE_1 ... 2 2.0 NaN 2.0 669771.246017 6.583997e+06 POINT (669771.246 6583997.357) 0.410972 0.325 not significant

5 rows × 28 columns

# Plot hotspots and coldspots
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
ax.set_aspect('equal')

hotspots = fires[fires['hotspot'] == 'hotspot']
coldspots = fires[fires['hotspot'] == 'coldspot']
not_significant = fires[fires['hotspot'] == 'not significant']

hotspots.plot(ax=ax, color='red', marker='o', label='Hotspots', markersize=15, zorder=3)
coldspots.plot(ax=ax, color='blue', marker='o', label='Coldspots', markersize=15, zorder=2)
not_significant.plot(ax=ax, color='lightgray', marker='o', label='Not Significant', markersize=14, zorder=1)

ax.set_title(f"Hotspots and Coldspots of landscape fires")
ax.legend()

# Add basemap
contextily.add_basemap(
    ax, source=contextily.providers.CartoDB.Voyager,
    crs="EPSG:3301"
)

We see from the results that there are cldspots around Tartu and Tallinn which means that there is aggregation of small fire incidents. At the same time there are larger fire incidents further away from the cities that makes sense as there is also more fuel for the fire and more likely that the fire gets big before it is discovered.