Thursday, March 15, 2012


Popularity analysis for merging geotagged data

Tanel Tammet

The new site http://www.sightsmap.com integrates wikipedia, wikitravel, foursquare and panoramio on a popularity heatmap.

We do not use any "professional" datasets: all the information shown on the maps (except the underlying google maps, of course) is crowdsourced by panoramio, wikipedia, wikitravel and foursquare users.
The key technology in achieving the right merging and presentation of crowdsourced data is combined popularity analysis of locations and wikipedia articles.

First we created a visual popularity heatmap of the whole world, using both the number of photos and separate photographers in the panoramio database for each area, starting from the coarse-grained whole world grid down through six successive layers of smaller and smaller grids with higher and higher resolutions. The data used comes from the panoramio public api. The colour of each pixel on the heatmap is calculated by the square-root-like function on the photos-and-photographers count of the corresponding grid square, with slight modifications to the function for each grid layer.

The next step was to create a popularity index for geotagged wikipedia articles, excluding the ones with clearly non-geographical content, like people. The geotagged articles are obtained from the dbpedia project and the popularity data for wikipedia is obtained from the wikipedia log files: we used two full days of logfiles, one in summer and one in winter. The popularity rank of the articles is further modified by the type and additional properties of the article as given by dbpedia: for example, if an article concerns a world heritage site, it receives a strong additional bonus.

The combination of the two datasets - heatmap and wikipedia - uses an algorithm which looks for the most highly ranked wikipedia articles geotagged around the top heatmap spots for each subgrid on each layer. First we cluster the heatmap dots to avoid showing lots of markers very close to each other. Then we look for the most popular wikipedia articles near the hotspots: the higher-ranked a heatmap spot is, the larger the area to search. If nothing is found or the found article has a much lower popularity than the heatmap spot, we do not attach anything to the hotspot. Otherwise we connect a hotspot to the wikipedia article plus the corresponding wikitravel article, if available.

Knowing a highest-ranked wikipedia article for an area helps to google for more: the markers additionally give a direct search link to the title.

In addition to the six world-covering layers of successively smaller gridsteps with higher resolutions we created separate ultra-high-res heatmaps for 15000 top hotspots, most of them cities. The resolution of  these high-res heatmaps depends on the popularity rank of the hotspots: the more photos, the higher the resolution, up to street level for top 500.

The ultra-high-res heatmaps are then populated with the combined wikipedia and foursquare markers for top spots in this heatmap, using an algorithm which first tries to associate wikipedia and foursquare to the most popular places on the map, and finally merges the top wikipedia and foursquare articles to the mix, even if they are not located near a visually attractive spot. The foursquare data is obtained via the public api only for the areas surrounding the hotspots, differently from panoramio and wikipedia, which are obtained for the whole world.

Again, we exclude both geotagged wikipedia articles and foursquare locations with obviously non-geographic or non-sightseeing type.  We add bonuses to articles and locations based on the suitability of their type: for example, castles, churches and public squares get different bonuses. 

The end result is a large set of overlay png tiles for each resolution along with their associated json datafiles containing wikipedia/wikitravel/foursquare places, ranks and types.

When showing the map on the browser, we calculate the necessary tiles and json datafiles using javascript, each time a user viewport changes, either by panning or zooming. The new files are loaded, visual parts of tiles are presented and the top location data objects are shown on markers, color-coded based on their relative popularity in the visual area. The default view of the inside-city-markers is geared for sighseeing. If the user chooses to see places to eat, drink or sleep instead, we will just show the foursquare locations with the corresponding types, ranked purely by the total amount of visitors of the foursquare location. Since all the map layers and datafiles are precomputed, the whole application runs in the browser with nothing done on the server side except serving static files.