So far I have found my BirdWatch application nice to look at but not terribly useful as an original way of finding information. Let’s face it - the vast majority of tweets are not terribly useful. But there are some in there that are highly relevant. What are their characteristics? At the most basic level, they come from people with huge numbers of followers and / or have been re-tweeted a lot. It’s these tweets that have a large audience, not the ones from users with low follower counts. The latter make up the majority of the chatter, though. How do we find these more relevant tweets within an observation period?
I am running a private instance of this application which is listening to tweets on US politics. In this instance of the application I have been increasingly annoyed by an overwhelming amount of irrelevancy. I’d search for “Obama Syria” and get shiploads of tweets from crazies; finding the relevant stuff was next to impossible when I only had the result set sorted by time.
Crossfilter to the rescue. Over the weekend I finally had time to integrate it into the project. Now you will be able to sort tweets not only in natural order (by time) but also by the number of followers of the author
or the number of times a particular tweet has been retweeted. As usual you can try this out.
The re-tweets sort order currently evaluates the number of total re-tweets during the entire lifecycle of the tweet, which makes this sort order somewhat biased towards older tweets that were re-tweeted a lot in the past but not necessarily proportionately often during the observation time, which is the time span between now (whenever looking at the page as searches are live) and the oldest tweet in the data set. One additional metric could be the number of retweets of a tweet during the observation period, not the total number. That should not be all that difficult using crossfilter.
Let’s have a look at the source code. The Crossfilter object lives in an AngularJS service, which is a singleton within the application. The functionality is then exposed through exported functions for adding data, clearing the crossfilter and retrieving items for the paginated tweets page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
Depending on the selected sort order different dimensions are used to generate the paginated tweets list. Sorting by time of tweeting is achieved with a dimension sorting by tweet IDs (which are in chronological order). Another dimension sorts tweets by the follower count of the tweet author. In this case, maxRetweets (mapper function) looks up all retweets within the data set in memory and sets the retweet count to the highest value found. The tweets with the highest number of retweets are found using the retweets dimension. Within this dimension multiple versions of the same original tweet are returned when the tweet has been retweeted multiple times during the observation period. The _.uniq function from underscore.js is used to filter out those duplicate entries. The descending order of retweet_count in the returned array from the dimension guarantees that the version of a retweet with the highest re-tweet count is found first and retained.
The paginated data is generated by retrieving all items from the selected dimension up to the current page. The _.rest function from underscore.js then drops the items for all pages that come before the current page.
AngularJS then takes care of rendering a view by calling the tweetPage function from the crossfilter service every time the UI is updated. This means that the visual representation of the data is always up to date, with automatic updates for example when a tweet in the followers order is retweeted again. All that without having to manipulate the DOM directly, thanks to AngularJS.
Evaluating the crossfilter dimension functions again and again can be problematic when tens of individual tweets per second arrive through the Server Sent Events (SSE) connection with the server, though. In order to avoid evaluating the crossfilter functions multiple times per second I use _.throttle in the registerCallback function in controllers.js:
1 2 3 4 5
By the way, you can now increase the number of pre-loaded tweets to up to 20,000 under settings. That may slow the application down, though. A lot of things aren’t perfect yet, but overall it seems to be working fine.
Anyhow, I will go into more detail later. The source code for the entire application can be found on GitHub. My previous article is the place to go for an explanation of the overall architecture of the application. It is a work in progress and I will get back to it in the next couple of days. For now I just wanted to give you a quick update on what I have been up to this weekend.
Until next time, Matthias