From Historical OpenStreetMap data to customized training samples for geospatial machine learning
Hao Li and Zhaoyan Wu
Recently, OpenStreetMap (OSM) shows great potentials in providing massive and freely accessible training samples to further empower geospatial machine learning activities. We developed a flexible framework to automatically generate customized training samples from historical OSM data, which in the meantime provide the OSM intrinsic quality measurements as an additional feature. Moreover, different satellite imagery APIs and machine learning tasks are supported within the framework.
After more than a decade rapid development of volunteered geographic information (VGI), VGI has already become one of the most important research topics in GIScience community [1]. Almost in the meantime, we have witnessed the ever-fast growth of geospatial machine learning technologies in intelligent GiServices [2] or addressing remote sensing tasks [3], for instance land use land cover classification, object detection, and change detection. Nevertheless, the lack of abundant training samples as well as accurate semantic information has been long identified as a modelling bottleneck of such data-hungry machine learning application. Correspondingly, OpenStreetMap (OSM) shows great potentials in tackling this bottleneck challenge by providing massive and freely accessible geospatial training samples [4]. More importantly, OSM has exclusive access to its full historical data [5], which could be further analyzed and employed to provide intrinsic data quality measurements of the training samples. Therefore, a flexible framework for labeling customized geospatial objects using historical OSM data allows more effective and efficient machine learning.
This work approaches the topic of labeling geospatial machine learning samples by providing a flexible framework for automatically customized training samples generation and intrinsic data quality measurement. In more detail, we explored the historical OSM data for twofold purposes of feature extraction and intrinsic assessment. For examples, when training a building detection convolutional neural networks (CNNs), the OSM features with tags as building=residential or building=house are certainly of interests while the data quality of such features might play an important role later in the CNNs training phase. Therefore, besides the acquisition of the user-defined OSM features, we provide additional intrinsic quality measurements. Currently, we consider some basic statistics, such as the areas of buildings tagged with different OSM tags, the amount of distinct contributors in the last six months, or the equisdistance of polygons with landuse=cropland etc , since the existing research suggested that the lower equisdistance of the current polygon, the better relative quality of the polygon, which due to the further refining and editing from users [6]. In the future, one could also easily extend the current framework and develop other sophisticated quality indicators for specific “fitness-for-use” purposes.
Heterogeneous remote sensing APIs are supported within the framework, user’s option ranges from commercial satellite image providers (e.g., Bing or Mapbox) to government satellite missions (e.g., Sentinel-hub), even user-defined tile map service (TMS) API. Correspond to OSM features, the satellite image would be automatically downloaded via TMS and tiled into proper size. Moreover, this framework also supports different machine learning tasks, like classification, object detection, and semantic segmentation, which requires distinct sample formats. The preliminary test is performed to extract the geographical information of water dams with OSM tag waterway=dam, which enables the training of water dams detection CNNs, where users could easily change the geospatial water dams to customize objects as long as the corresponding OSM tags are identified.
This work aim to promote the application of geospatial machine learning by generating and assessing OSM training samples of user-specified objects, which not only allows user to train geospatial detection models, but also introduce the intrinsic quality assessment into the “black box” of the training of machine learning models. Based on a deeper understanding of training samples quality, future efforts are needed towards more understandable and geographical aware machine learning models.
References
[1] Yan, Y., Feng, C., Huang, W., Fan, H., Wang, Y. & Zipf, A., (2020) Volunteered geographic information research in the first decade: a narrative review of selected journal articles in GIScience, International Journal of Geographical Information Science.
[2] Yue, P., Baumann, P., Bugbee, K. & Jiang, L., (2015). Towards intelligent GIServices. Earth Sci Inform 8, 463–481.
[3] Zhu, X., Tuia, D., Mou, L., Xia, G., Zhang, L., Xu, F. & Fraundorfer, F. (2017) Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8-36.
[4] Li, H., Herfort , B., Zipf , A. (2019 ): Estimating OpenStreetMap Missing Built up Areas using Pre trained Deep Neural Networks. Proceedings of the 22nd AGILE Conference, Cyprus.
[5] Raifer, M., Troilo, R., Kowatsch, F., Auer, M., Loos, L., Marx, S., Przybill, K., Fendrich, S., Mocnik, F.-B.& Zipf, A., 2019. OSHDB: a framework for spatio-temporal analysis of OpenStreetMap history data. Open Geospatial Data, Software and Standards 4, 3.
[6] Barron, C., Neis, P. & Zipf, A. (2014) A Comprehensive Framework for Intrinsic OpenStreetMap Quality Analysis. Transactions in GIS, 18(6), 877–895.