Matthias Nehlsen

Software, Data and Stuff

Using Crossfilter with AngularJS

So far I have found my BirdWatch application nice to look at but not terribly useful as an original way of finding information. Let’s face it - the vast majority of tweets are not terribly useful. But there are some in there that are highly relevant. What are their characteristics? At the most basic level, they come from people with huge numbers of followers and / or have been re-tweeted a lot. It’s these tweets that have a large audience, not the ones from users with low follower counts. The latter make up the majority of the chatter, though. How do we find these more relevant tweets within an observation period?

I am running a private instance of this application which is listening to tweets on US politics. In this instance of the application I have been increasingly annoyed by an overwhelming amount of irrelevancy. I’d search for “Obama Syria” and get shiploads of tweets from crazies; finding the relevant stuff was next to impossible when I only had the result set sorted by time.

Crossfilter to the rescue. Over the weekend I finally had time to integrate it into the project. Now you will be able to sort tweets not only in natural order (by time) but also by the number of followers of the author

images

or the number of times a particular tweet has been retweeted. As usual you can try this out.

images

The re-tweets sort order currently evaluates the number of total re-tweets during the entire lifecycle of the tweet, which makes this sort order somewhat biased towards older tweets that were re-tweeted a lot in the past but not necessarily proportionately often during the observation time, which is the time span between now (whenever looking at the page as searches are live) and the oldest tweet in the data set. One additional metric could be the number of retweets of a tweet during the observation period, not the total number. That should not be all that difficult using crossfilter.

Let’s have a look at the source code. The Crossfilter object lives in an AngularJS service, which is a singleton within the application. The functionality is then exposed through exported functions for adding data, clearing the crossfilter and retrieving items for the paginated tweets page.

Crossfilter servicecrossfilter.js
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
'use strict';

// crossfilter service
angular.module('birdwatch.services').service('cf', function (utils) {
    var exports = {};

    // crossfilter object: browser side analytics library, holds array type data (w/incremental updates).
    // dimensions are fast queries on data, e.g. view sorted by followers_count or retweet_count of the original message
    var cf = crossfilter([]);
    var tweetIdDim   = cf.dimension(function(t) { return t.id; });
    var followersDim = cf.dimension(function(t) { return t.user.followers_count; });
    var retweetsDim  = cf.dimension(function(t) {
        if (t.hasOwnProperty("retweeted_status")) { return t.retweeted_status.retweet_count; }
        else return 0;
    });
    var originalIdDim  = cf.dimension(function(t) {
        if (t.hasOwnProperty("retweeted_status")) { return t.retweeted_status.id; }
        else return 0;
    });

    // freeze imposes filter on crossfilter that only shows anything older than and including the latest
    // tweet at the time of calling freeze. Accordingly unfreeze clears the filter
    exports.freeze    = function() { tweetIdDim.filter([0, tweetIdDim.top(1)[0].id]); };
    exports.unfreeze  = function() { tweetIdDim.filterAll(); };

    exports.add       = function(data)     { cf.add(data); };                            // add new items, as array
    exports.clear     = function()         { cf.remove(); };                             // reset crossfilter
    exports.noItems   = function()         { return cf.size(); };                        // crossfilter size total
    exports.numPages  = function(pageSize) { return Math.ceil(cf.size() / pageSize); };  // number of pages

    // predicates
    var retweeted     = function(t) { return t.hasOwnProperty("retweeted_status"); };

    // mapper functions
    var originalTweet = function(t) { return utils.formatTweet(t.retweeted_status); };   // returns original tweet
    var tweetId       = function(t) { return t.id; };                                    // returns tweet id
    var retweetCount  = function(t) { if (retweeted(t)) { return t.retweeted_status.retweet_count; } else return 0 };
    var maxRetweets   = function(t) {
        t.retweet_count = retweetCount(_.max(originalIdDim.filter(t.id).top(1000),
            function(t){ return t.retweeted_status.retweet_count; }));
        originalIdDim.filterAll();
        return t;
    };

    // deliver tweets for current page. fetches all tweets up to the current page,
    // throws tweets for previous pages away.
    exports.tweetPage = function(currentPage, pageSize, order, live) {
        return _.rest(fetchTweets(currentPage * pageSize, order), (currentPage - 1) * pageSize);
    };

    // fetch tweets from crossfilter dimension associated with particular sort order up to the current page,
    // potentially mapped and filtered
    var fetchTweets = function(pageSize, order) {
      if      (order === "latest")    { return tweetIdDim.top(pageSize); }    // latest: desc order of tweets by ID
      else if (order === "followers") {
          return followersDim.top(pageSize).map(maxRetweets);
      }   // desc order of tweets by followers
      else if (order === "retweets") {  // descending order of tweets by total retweets of original message
          return _.first(               // filtered to be unique, would appear for each retweet in window otherwise
              _.uniq(retweetsDim.top(cf.size()).filter(retweeted).map(originalTweet), false, tweetId), pageSize);
      }
      else { return []; }
    };

    return exports;
});

Depending on the selected sort order different dimensions are used to generate the paginated tweets list. Sorting by time of tweeting is achieved with a dimension sorting by tweet IDs (which are in chronological order). Another dimension sorts tweets by the follower count of the tweet author. In this case, maxRetweets (mapper function) looks up all retweets within the data set in memory and sets the retweet count to the highest value found. The tweets with the highest number of retweets are found using the retweets dimension. Within this dimension multiple versions of the same original tweet are returned when the tweet has been retweeted multiple times during the observation period. The _.uniq function from underscore.js is used to filter out those duplicate entries. The descending order of retweet_count in the returned array from the dimension guarantees that the version of a retweet with the highest re-tweet count is found first and retained.

The paginated data is generated by retrieving all items from the selected dimension up to the current page. The _.rest function from underscore.js then drops the items for all pages that come before the current page.

AngularJS then takes care of rendering a view by calling the tweetPage function from the crossfilter service every time the UI is updated. This means that the visual representation of the data is always up to date, with automatic updates for example when a tweet in the followers order is retweeted again. All that without having to manipulate the DOM directly, thanks to AngularJS.

Evaluating the crossfilter dimension functions again and again can be problematic when tens of individual tweets per second arrive through the Server Sent Events (SSE) connection with the server, though. In order to avoid evaluating the crossfilter functions multiple times per second I use _.throttle in the registerCallback function in controllers.js:

Insertion Cache inside Controllercontrollers.js
1
2
3
4
5
insertionCache = insertionCache.concat(t);    // every received item is appended to insertionCache.
_.throttle(function() {                       // throttle because every insertion triggers expensive
    $scope.wordCount.insert(insertionCache);  // $scope.apply(), insert cache once every 3 seconds,
    insertionCache = [];                      // then empty cache.
}, 3000)();

By the way, you can now increase the number of pre-loaded tweets to up to 20,000 under settings. That may slow the application down, though. A lot of things aren’t perfect yet, but overall it seems to be working fine.

Anyhow, I will go into more detail later. The source code for the entire application can be found on GitHub. My previous article is the place to go for an explanation of the overall architecture of the application. It is a work in progress and I will get back to it in the next couple of days. For now I just wanted to give you a quick update on what I have been up to this weekend.

Until next time, Matthias

« BirdWatch explained HTML5 Template with Live Reload and 100/100 PageSpeed »

Comments