We are interested in seeing what social media data can tell us about the society at large. More concretely, we want to see if we can develop tools to take the digital pulse of a city. For example, where do different people go to shop? Or when and where do people go to exercise? The current demo is just one small step that gives users the opportunity to explore data to see life in a city with new eyes.
After making a particular (un-)selection, e.g. selecting only the language "Russian" with all topics, days and times, certain areas on the map light up in red. Red areas on the map indicate "Hot spots" with more than expected activity for the selected filters. "Cold spots" with less than expected activity are not marked on the map to avoid clutter. However, filters marked in cyan on the right side bar indicate that, say, for the topic "Transport", there is less activity than expected during the night. For example, the area around Brighton Beach lights up in red as out of the tweets matching the filter settings, an unusually large fraction matches this area when compared to the background distribution of all tweets. We observe similar results for other languages, e.g. selecting Italian lights up parts of Little Italy, etc. When things are either "as expected" or there's not enough data to tell then we leave the corresponding area blank.
For this demo we use public, geo-tagged tweets that are obtained through the Twitter APIs (), using both their REST API () and their Streaming API (). In total, we have collected 4.8 million public tweets coming from 254,495 distinct users. Most of the data comes from Jan-Nov 2013. Only tweets that are geo-tagged with (latitude, longitude) are considered, and tweets that lie outside of New York City (Manhattan), Kings County (Brooklyn), The Bronx, Staten Island and Queens County are ignored. We will periodically add fresh data in the future.
Currently, we use a list of hand-compiled dictionaries for English, Spanish, Italian, French, Russian, Chinese and Arabic, the most common languages in tweets from NYC. For example, any tweet containing "swimming" would be marked as sports-related. Obviously, there are false positives such as "Things are going swimmingly ()" and false negatives such as "I was doing 200m of butterfly this morning ()". In the future we plan to improve the topic detection by integrating statistical language models such as Latent Dirichlet Allocation ().
We use a public tool for language detection which is available here. Note that the language detection of very short tweets or tweets that contain only a URL is difficult and the tool will sometimes make mistakes. Also Arabizi () or other transliterations of non-Latin scripts are currently not supported by this tool.
Some parts in Downtown Manhattan have a sufficiently large amount of geo-tagged tweets to aggregate statistics for small geographic areas. In other parts, the tweets volume is much lower. In an effort to still have sufficient data to reason about using statistics, we resort to aggregating data from a larger area. The exact recursive splitting algorithm uses quadtrees ().
The lead scientist is Ingmar Weber (@ingmarweber) in QCRI's Social Computing group (). Most of the implementation is done by Kiran Garimella (@gvrkiran). .