How to predict the number of emergency calls in different parts of the city using the geodata analysis?

Try to solve the problem from the online hackathon Geohack.112 . It is given: the territory of Moscow and Moscow region was divided into squares of sizes from 500 to 500 meters. The average number of emergency calls per day (numbers 11? 10? 10? 10? 10? 01? 02? 03? 040) is presented as the initial data. The region under consideration was divided into the western and eastern parts. Participants are invited, after studying in the western part, to predict the number of emergency calls for all the eastern squares.
How to predict the number of emergency calls in different parts of the city using the geodata analysis? Kendall's Tau-b , is considered as the proportion of pairs of objects with incorrectly ordered predictions, corrected for pairs with the same value of the target variable. The metric estimates the order in which the predictions are related to each other, not their exact values. Different days of the week are considered independent elements of the sample, i.e. the correlation coefficient is calculated by the predictions for all test pairs (zone_id, day of the week).
We read the table of squares with the help of Pandas [/b]
import pandas
df_zones = pandas.read_csv ('data /zones.csv', index_col = 'zone_id')
df_zones.head ()


Extracting characteristics from OpenStreetMap

The prediction of the number of calls can be carried out by methods of machine learning. To do this, each square needs to construct a vector with an indicative description. Having trained the model on the marked western part of Moscow, it can be used to predict the target variable in the eastern part.
According to the conditions of the competition, the data for solving the problem can be taken from any open sources. When it comes to describing a small area on the map, the first thing that comes to mind is OpenStreetMap, an open, non-commercial electronic map that is created by the community.
The OSM map is a set of elements three types:
Node: points on the map
Way: roads, squares, are set by a set of points
Relation: links between elements, for example, the combination of a road from several parts
Elements can have a set of tags - key-value pairs. Here is an example of a store that is listed on the map as an element of the Node type:

The current unloading of OpenStreetMap data can be taken from the site . Since we are interested in Moscow and its immediate surroundings, we will upload the file RU-MOS.osm.pbf - part of OpenStreetMap from the corresponding region in the binary format osm.pbf . To read such files from Python, there is a simple osmread library.
Before starting to work, we consider from OSM all elements of type Node from the area we need, which have tags (the rest we will cut off and will not use in the future).
The organizers of the competition prepared a baseline, which is available at . All the code that follows is contained in this baseline.
Loading objects from OpenStreetMap [/b]
import pickle
import osmread
from tqdm import tqdm_notebook
LAT_MIN, LAT_MAX = ???? ???r3r3510. LON_MIN, LON_MAX = ???? ???r3r3510.
osm_file = osmread.parse_file ('osm /RU-MOS.osm.pbf')
tagged_nodes =[
for entry in tqdm_notebook(osm_file, total=18976998)
if isinstance(entry, osmread.Node)
if len(entry.tags) > 0
if (LAT_MIN < < LAT_MAX) and (LON_MIN < entry.lon < LON_MAX)
# Save the list with the selected objects in a separate file
with open ('osm /tagged_nodes.pickle', 'wb') as fout:
pickle.dump (tagged_nodes, fout, protocol = pickle.HIGHEST_PROTOCOL)
# A file with saved objects can be quickly downloaded to
with open ('osm /tagged_nodes.pickle', 'rb') as fin:
tagged_nodes = pickle.load (fin)

Working in Python, the geodata can be quickly visualized on an interactive map using the library folium , which under the hood is Leaflet.js - the standard solution for displaying OpenStreetMap.
Example of visualization with folium [/b]
import folium
fmap = folium.Map ([55.753722, 37.620657])
# We will put the railway station
for node in tagged_nodes:
if node.tags.get ('railway') == 'station':
folium.CircleMarker ([, node.lon], radius = 3) .add_to (fmap)
# select the squares with the largest number of calls
calls_thresh = df_zones.calls_daily.quantile (.99)
for _, row in df_zones.query ('calls_daily> @calls_thresh'). iterrows ():
folium.features.RectangleMarker (
.bounds = ((row.lat_bl, row.lon_bl), (row.lat_tr, row.lon_tr)),
fill_color = 'red',
) .add_to (fmap)
# The map can be saved and viewed in the browser ('map_demo.html')

We aggregate the resulting set of points by squares and construct simple signs:
1. The distance from the center of the square to the Kremlin is
2. The number of points in the radius R from the center of the square (for different values ​​of R)
a. all points with tags
b. railway stations
c. shops
d. public transport stops
3. The maximum and average distance from the center of the square to the nearest points of the types listed above
To quickly find the points that are in proximity to the center of the square, we will use data structure of the k-d tree , implemented in the package SciKit-Learn in the class NearestNeighbors.
Construction of a table with an indicative description of the squares [/b]
import collections
import math
import numpy as np
from sklearn.neighbors import NearestNeighbors
kremlin_lat, kremlin_lon = ???? ???r3r3510.
def dist_calc (lat? lon? lat? lon2):
R = ???r3r3510.
lat1 = math.radians (lat1)
lon1 = math.radians (lon1)
lat2 = math.radians (lat2)
lon2 = math.radians (lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = math.sin (dlat /2) ** 2 + math.cos (lat1) * math.cos (lat2) *
math.sin (dlon /2) ** 2
c = 2 * math.atan2 (math.sqrt (a), math.sqrt (1 - a))
return R * c
df_features = collections.OrderedDict ([])
df_features['distance_to_kremlin']= df_zones.apply (
lambda row: dist_calc (row.lat_c, row.lon_c, kremlin_lat, kremlin_lon), axis = 1)
# set of point filters, according to which statistics
will be considered. POINT_FEATURE_FILTERS =[
('tagged', lambda
node: len(node.tags) > 0),
('railway', lambda node: node.tags.get('railway') == 'station'),
('shop', lambda node: 'shop' in node.tags),
('public_transport', lambda node: 'public_transport' in node.tags),
# centers of squares in the form of a matrix
X_zone_centers = df_zones[['lat_c', 'lon_c']].as_matrix ()
for prefix, point_filter in POINT_FEATURE_FILTERS:
# take a subset of points in accordance with the filter
coords = np.array ([
[, node.lon].
, for node in tagged_nodes
, if point_filter (node)
# build a data structure for fast search of points
neighbors = NearestNeighbors (). fit (coords)
# sign of the form "number of points in radius R from the center of the square"
for radius in[0.001, 0.003, 0.005, 0.007, 0.01]:
dists, inds = neighbors.radius_neighbors (X = X_zone_centers, radius = radius)
df_features['{}_points_in_{}'.format(prefix, radius)]=
np.array ([len(x) for x in inds])
# sign of the form "distance to the nearest K points"
for n_neighbors in[3, 5, 10]:
dists, inds = neighbors.kneighbors (X = X_zone_centers, n_neighbors = n_neighbors)
df_features['{}_mean_dist_k_{}'.format(prefix, n_neighbors)]=
dists.mean (axis = 1)
df_features['{}_max_dist_k_{}'.format(prefix, n_neighbors)]=
dists.max (axis = 1)
df_features['{}_std_dist_k_{}'.format(prefix, n_neighbors)]=
dists.std (axis = 1)
# sign of the form "distance to the nearest point"
df_features['{}_min'.format(prefix)]= dists.min (axis = 1)
# Create a table and save it in features.csv
df_features = pandas.DataFrame (df_features, index = df_zones.index)
df_features.to_csv ('data /features.csv')
df_features.head ()

As a result, for each square of the training and test sample, we have a feature description that can be used for prediction. A small modification of the proposed code can be taken into account in the attributes of objects of other types. Participants can add their tags, thereby giving the model more information about the urban development.
Predicting the number of calls
Experienced participants in data analysis competitions know that to get a high place it is important not only to look at the public leaderboard, but also to do their own validation on the training sample. Let's make a simple partition of the training part of the sample into subsamples for training and validation in the ratio 70/30.
Take only the target squares and train the random forest model (RandomForestRegressor) to predict the average number of calls per day.
Allocation of the validation subsample and training RandomForest [/b]
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
df_zones_train = df_zones.query ('is_test == 0 & is_target == 1')
idx_train, idx_valid = train_test_split (df_zones_train.index, test_size = 0.3)
X_train = df_features.loc[idx_train, :]
y_train = df_zones.loc[idx_train, 'calls_daily']
model = RandomForestRegressor (n_estimators = 10? n_jobs = 4) (X_train, y_train)

We will evaluate the quality on the validation subsample, for all the days of the week we will build the same forecast. The quality metric is a non-parametric correlation coefficient of Kendall tau-b, it is implemented in package SciPy as a function of scipy.stats.kendalltau.
It turns out the Validation score: ???r3r3499.  
This is not bad, because the value of the metric 0 means no correlation, and 1 - the total monotonic correspondence of the real values ​​and the predicted ones.
Forecast and quality assessment on validation [/b]
from scipy.stats import kendalltau
X_valid = df_features.loc[idx_valid, :]
y_valid = df_zones.loc[idx_valid, 'calls_daily']
y_pred = model.predict (X_valid)
target_columns =['calls_wd{}'.format(d) for d in range(7)]
df_valid_target = df_zones.loc[idx_valid, target_columns]
df_valid_predictions = pandas.DataFrame (collections.OrderedDict ([
(column_name, y_pred)
for column_name in target_columns
]), index = idx_valid)
df_comparison = pandas.DataFrame ({
'target': df_valid_target.unstack (),
'prediction': df_valid_predictions.unstack (),
valid_score = kendalltau (df_comparison['target'], df_comparison['prediction']) .correlation
print ('Validation score:', valid_score)

Before joining the ranks of participants who will receive an invitation to the Data Fest, there is one small step: to build a table with predictions on all test squares and send them to the system.
Construction of predictions on test [/b]
idx_test = df_zones.query ('is_test == 1'). index
X_test = df_features.loc[idx_test, :]
y_pred = model.predict (X_test)
df_test_predictions = pandas.DataFrame (collections.OrderedDict ([
(column_name, y_pred)
for column_name in target_columns
]), index = idx_test)
df_test_predictions.to_csv ('data /sample_submission.csv')
df_test_predictions.head ()

Why analyze your location?
The main sources of data, not only spatial (geographical), but also temporary, for any telecom-company are the complexes of equipment for receiving and transmitting the signal located throughout the territory of the service countries - Base Stations (BS). Spatial data analysis is more difficult technically, but can yield significant benefits and features that add weight to the effectiveness of machine learning models.
In the company MTS the use of geographic data helps to expand the knowledge base. They can be used to improve the quality of the service provided in analyzing customer complaints about the quality of communication, improving the coverage area of ​​the communication signal, planning the development of the network, and solving other problems where the space-time connection is an important factor.
Increasing population density, especially in large cities, the rapid development of infrastructure and road network, space images and vector maps, the number of buildings and POI (points of interest) from open maps of the area associated with the provision of services - all these spatial data can be obtained from external sources. Such data can be used as an additional source of information, and will also provide an objective picture regardless of network coverage.
Do you like the task? Then we invite you to take part in the online hackathon on the analysis of geographical data Geohack.112 . Register and upload your solutions to site until 24 April. The authors of the three best results will receive cash prizes. Individual nominations are provided for participants who presented the best public solution of the problem and better visualization of data journalism. The total prize fund of MTS GeoHack is ??? rubles . We hope to see new interesting approaches to the generation of spatial features, the visualization of geodata and the use of new open sources of information.
The winners will be awarded on April 28 at the conference DataFest .
+ 0 -

Add comment