This is the follow-up on last week’s article about AngularJS and Play Framework. I want to share with you the problems that I encountered on the server side while running the demo as well as my ideas of how to deal with comparable problems more efficiently in the future. I have not encountered any AngularJS-related problems with the chat application so we won’t deal with it today. I’ll have more on AngularJS next week.
So, what was the problem?
The article was online and the demo was running. Load on the server was not very high, I saw up to 60 users connected at the same time, so really not a load substantial enough that could cause problems. Yet sometimes when I connected, the chat room would load, but messages from the Shakespeare-reciting actors would not be delivered. I first suspected that the actors had crashed and no proper supervision strategy was in place. But there was nothing in the logs. Most of the time, everything did work, and when things didn‘t, I just restarted the server, scratching my head as to what the problem could be. I couldn‘t reproduce the problem in my dev environment, so there was really only one option: debug the system running on the server. Thanks to Server Sent Events (SSE), restarting a server is not a big deal, the SSE connection will reconnect automatically. Sure, messages that occur between dropped connection and reconnect will be lost, but that‘s not a problem for my demo, and nothing that can‘t be solved if need be.
Still, it doesn‘t feel right to keep inserting the logging code, recompiling and restarting an application in a “production” environment that has users connected to it. After experimenting with log levels and putting println statements into the code and then observing the shell, I eventually noticed a connection timeout error and afterwards a flood of message deliveries to all connected clients. What was going on here?
There was really only one explanation that made any sense to me: occasionally one of the connected clients would not properly disconnect, maybe on a mobile connection, and then that connection would time out after a while. So far so good. But why would that hold up other clients? Could it be that Concurrent.broadcast really came to a complete halt when any one of the attached Enumeratee / Iteratee chains took longer than usual? Turns out the answer is yes, unless extra steps are taken. Let‘s look at a simple example. I recommend you fire up the Play / Scala REPL using play console and copy & paste the code below:
1 2 3 4 5 6 7 8 9 10 11
Above we create Enumerator and Channel through Concurrent.broadcast and attach two Iteratees, one of which occasionally puts its thread to sleep for 5 seconds. It holds up the other attached Iteratee as well. That’s not what I need here. How can we overcome this? By inserting a buffering Enumeratee:
1 2 3 4 5 6 7 8 9 10 11 12
Now the application behaves more like I would expect; individual Iteratees do not hold up everything any longer. Instead the buffering Enumeratee receives the messages, buffers them and frees up Concurrent.broadcast to call the next Iteratee with the current message. The buffer also drops messages when it is full.
Now after adding the buffering Enumeratee to the chat application, everything works just fine, as long as the individual buffers are large enough.
1 2 3 4 5 6
How can I handle a problem like this better the next time?
Unit testing would hardly have helped here, unless we tested for the scenario of an incorrectly closed the WS connection. Better knowledge about the timing of events through better logging would have helped immensely, though. Logging to files is not extremely useful when trying to find anomalies like the aforementioned timeouts and spikes directly thereafter; at least my eyes are not good at detecting this in plain text.
What I would like to have instead is a fully searchable log in a webclient. I am thinking about something like Kibana 3. With graphs as shown in the demo, the problem would have been immediately obvious. Long streaks of smooth delivery and then all of a sudden no messages dispatched to clients for two minutes and then the system catching up with a big spike.
So I am looking into combining Kibana 3 with a logging object in Play that receives all loggable events and which, depending on the environment (dev or production), by default either processes them further or not, with further configuration of the defaults for each logging type when specified. The logging object could then also expose an endpoint that allows switching individual loggers on or off during execution, without restarting the application. That way we could easily peek into a running instance in a production environment and watch what is going on, right now. I started working on this and I will present something here soon. Having such a logging system available will be a great help for all projects in the pipeline. I would like to shift my focus back to the BirdWatch project rather sooner than later, but first I want to have the right tools in place.