When dealing with geographical datasets it is sometimes useful to get a visual representation of the data's spatial distribution (or density). Thankfully this is quite simple and efficient to achieve in Python with a Matplotlib histogram. To illustrate this we'll use the dataset provided in Kaggle's ECML/PKDD 15: Taxi Trajectory Prediction competition. This dataset contains millions of GPS coordinates recorded for all taxi trips taken in Porto, Portugal, over a complete year.
First, let's import all the libraries we need:
import copy
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
from code.data import load_data
%matplotlib inline
Now let's load the data:
data = load_data()
(Note: load_data
is just a custom function to load data from the CSV files provided for the competition. The code for that function isn't really relevant for the purposes of this blog post, but you may find it on Github if you're interested.)
The POLYLINE_FULL
column contains all GPS coordinates — in the format (latitude, longitude)
— for each trip. Here's a small sample for the first few trips:
data.train['POLYLINE_FULL'].head()
553433 [(41.14575, -8.610849), (41.145912, -8.610903)... 164647 [(41.146137, -8.617509), (41.137425, -8.614656... 439067 [(41.146164, -8.617554), (41.145696, -8.617518... 619178 [(41.165019, -8.628219), (41.164767, -8.628111... 1629758 [(41.154444, -8.649792), (41.154372, -8.649783... Name: POLYLINE_FULL, dtype: object
See a full list of coordinates for a given trip:
print data.train['POLYLINE_FULL'].iloc[0]
[(41.14575, -8.610849), (41.145912, -8.610903), (41.146047, -8.610786), (41.146623, -8.61003), (41.147073, -8.608977), (41.147559, -8.60868), (41.148396, -8.60823), (41.149134, -8.607744), (41.149908, -8.60724), (41.150556, -8.606961), (41.150637, -8.606916), (41.151069, -8.60661), (41.152401, -8.605818), (41.153454, -8.605521), (41.153526, -8.607528), (41.153544, -8.609733), (41.153625, -8.611785), (41.154831, -8.611983), (41.155164, -8.613063), (41.155164, -8.613081), (41.155371, -8.613594), (41.155794, -8.615727), (41.156028, -8.617437), (41.156073, -8.617824), (41.156433, -8.61993), (41.156775, -8.621694), (41.157126, -8.623836), (41.157549, -8.626338), (41.157954, -8.627985), (41.15871, -8.628786), (41.158728, -8.629578), (41.158737, -8.629821), (41.158611, -8.6301), (41.15862, -8.630145), (41.158611, -8.630118), (41.158593, -8.630118)]
Now let's merge all GPS coordinates for all trips in a flat list:
all_coordinates = np.concatenate(data.train['POLYLINE_FULL'].as_matrix())
We now have a list of about 80 millions coordinates:
print all_coordinates.shape
print all_coordinates
(78356174, 2) [[ 41.14575 -8.610849] [ 41.145912 -8.610903] [ 41.146047 -8.610786] ..., [ 41.156073 -8.627787] [ 41.156073 -8.627787] [ 41.156082 -8.627769]]
Here is now the code of our density map function:
def density_map(latitudes, longitudes, center, bins=1000, radius=0.1):
cmap = copy.copy(plt.cm.jet)
cmap.set_bad((0,0,0)) # Fill background with black
# Center the map around the provided center coordinates
histogram_range = [
[center[1] - radius, center[1] + radius],
[center[0] - radius, center[0] + radius]
]
fig = plt.figure(figsize=(5,5))
plt.hist2d(longitudes, latitudes, bins=bins, norm=LogNorm(),
cmap=cmap, range=histogram_range)
# Remove all axes and annotations to keep the map clean and simple
plt.grid('off')
plt.axis('off')
fig.axes[0].get_xaxis().set_visible(False)
fig.axes[0].get_yaxis().set_visible(False)
plt.tight_layout()
plt.show()
LogNorm
normalizes a given value to the 0-1 range on a logarithmic scale. Note that LogNorm
flags 0-value pixels — i.e. areas not covered by the provided GPS coordinates — as "bad" and will color them in white by default. I think the contrasts look better with a dark background, so here we tweak the color map to color those pixels in black instead. Finally, the bins
parameter allows you control the contrast of colors for the GPS points themselves.
Now let's draw the density map for all GPS coordinates and center it around Porto:
# Coordinates of Porto's city center
porto = [41.1579, -8.6291]
# Separate the latitude and longitude values from our list of coordinates
latitudes = all_coordinates[:,0]
longitudes = all_coordinates[:,1]
# Render the map
density_map(latitudes, longitudes, center=porto)
This rendering process is pretty interesting as it clearly highligthts Porto's busiest roads, the airport in the North West corner, as well as the coast line and the river estuary contours.