Initially I parsed the Tweets in the BirdWatch application into instances of a Tweet case class upon ingestion and then used that case class representation throughout, including for database persistence. Then I realized that that was actually not a good idea. Using a case class for passing around information in the application is very convenient and useful. But for the persistence, I argue that we cannot afford to be opinionated about what to keep and what to throw away. I fixed this together with the planned migration to ReactiveMongo version 0.9 in the latest commits, storing each observable fact coming from the Twitter Streaming API in its entirety.
Any data model will almost invariably be wrong in the future as we cannot predict what we will want to analyze later. We can always change the data model at a later point and from then on store a different interpretation of the observable fact, but then we would not have complete historic information when we want to test our hypotheses on retrospective data. The solution for this is to store the Tweets in their complete JSON representation. MongoDB is a great choice for this as it allows indexing our data while leaving the JSON structure intact. We get the best of two worlds. With this lossless persistence we can always reconstruct the observable fact from the database while at the same time being able to quickly search through a potentially large dataset.
I also wanted to upgrade ReactiveMongo in order to fix a previous problem with Killcursors. Version 0.9 entails some changes in the API, so it was a good idea to tackle the upgrade and the Tweet persistence layer together. Let’s go through some of the changes:
1 2 3 4 5
1 2 3 4 5 6
I have moved the Tweet collection and basic query and insert methods into a Tweet companion object, with the intention of turning this into a lightweight DAO (Data Access Object) for Tweets:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Storing the JSON from Twitter not only prevents us from throwing away data we might need in the future, it is also much simpler than having to deal with implicit BSONReader and BSONWriter objects as was previously the case:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
1 2 3 4
This is really all there is to storing JSON into MongoDB now. I don’t have to worry about additional fields or other changes in the Twitter Streaming API. If it is valid JSON, it will find its way into the database. Major changes to the API might break parsing into Tweets, but they would not break database persistence.
Error and status messages from Twitter also come as JSON, so they are stored as well:
1 2 3
Querying is more concise than before, making use of Json.obj instead of BSONDocuments:
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7
Curly braces get replaced by Json.obj() and the colon gets replaced by “->”. Other than that, the syntax is very close. Note the “$exists” part. This limits the results to only Tweets (and potentially error and status messages that have a “text” field, but I have not encountered those).
The usage above with generating a List from the cursor works fine for small n, but for larger results sets (say hundreds of thousands of items) it would be a bad idea to build the list in memory first. Luckily ReactiveMongo allows us to stream the results. That itself is not new, but since version 0.9 we can limit the number of results, making this much more useful for a latestN scenario:
1 2 3 4 5 6 7 8 9 10
With this we create an Enumerator of JsObjects that streams the results into an Iteratee. The usage of this is simple once we understand what this pattern means. Check out my previous Iteratee article, hope it helps a little bit.
This allows us to stream results into an Iteratee that will do whatever we need, in this case just doing a simple foreach:
1 2 3 4 5 6 7 8
I currently do not enumerate the results into an Iteratee because the Tweets would appear in the wrong order in the UI and I cannot easily reverse the direction in which the Tweets are enumerated without an auto-incrementing counter in MongoDB to determine from where to start enumerating in ascending order (from position [collectionsize - n]). But this is more a problem of the UI, the next versions will certainly make use of this pattern.
The only thing I was still missing is an easy way to get the size of a collection. In the shell we would write:
Turns out that in ReactiveMongo, we can use the Count command for this, which returns a Future[Int] with the result (see Tweet.scala above). This allows us to do something upon return of the collection size in a non-blocking way:
Great stuff, I really like ReactiveMongo. The documentation has also gotten a lot better in 0.9, compared to previous versions. Nonetheless it takes some source code reading to find some of the good stuff. I’d be more than to happy help out here and contribute to the project documentation.