HBONow had a highly embarrassing outage right on the night of the most highly anticipated Game of Thrones episode of the season. It’s the dread that every developer faces who has to deal with capacity issues but it also demonstrates the possible lack of knowledge the non-technical people have in understanding their audience at the HBO headquarters.
I’m pretty vehement against the corporate executives on the issue in that the situation never should have occurred in the first place. Considering there has been a 8 prior episodes that never experienced this type of outage, one must wonder what exactly changed that caused the outage. Could it be a hacker? Or maybe the added security after various leaks in past episodes that caused tighter code restraints? Either way, whatever changed obviously wasn’t tested enough in preparation for a night like tonight.
As a developer with a few tools at my disposal, I decided to inspect the site using Firebug with full screen refreshes to see if I could figure out what was wrong. The primary issue that I could see is the API call to the following URL:
https://subscription-activation-api.hbonow.com/ws/subscription/flow/activation.status
Most of the responses were 500’s (meaning internal server errors), but I did get one 503 which showed up with:
503 Service Unavailable: Back-end server is at capacity
Very odd. So does this mean that the people did not anticipate the traffic considering the build up over the past few weeks?
Also, what type of infrastructure is used behind HBONow? Is it an existing cloud based service like Amazon? Or did they develop their own cloud based infrastructure as a result of lacking trust in 3rd party solutions?
After sometime, a new URL popped up:
https://user.hbonow.com/v2/user/f1136005-9f00-4bdf-bddd-82fa923655ba-6943-3abfa982683cf092c5321b6eee34391819132db0/feature/entitled
Another 500 error resulted with the following message being returned:
{"code":"-100000","message":" [Unexpected Exception] [com.bamnetworks.registration.types.exception.UnexpectedRegistrationException
] Caused by [java.util.concurrent.RejectedExecutionException: Task scala.concurrent.impl.Future$PromiseCompletingRunnable
@4347dd51 rejected from java.util.concurrent.ThreadPoolExecutor@623e060c[Running, pool size = 2000, active
threads = 2000, queued tasks = 5000, completed tasks = 178752]]"}
This looks pretty bad (although interesting considering that Scala and Java are mentioned here). Big thing to note is this concurrency issue. Maybe they need to up the queue limit and number of threads?
Another interesting thing (at least from Firefox) is that the above URL also showed a CORS security error:
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://user.hbonow.com/v2/user/f1136005-9f00-4bdf-bddd-82fa923655ba-6943-3abfa982683cf092c5321b6eee34391819132db0/feature/entitled. (Reason: CORS header 'Access-Control-Allow-Origin' missing).
Yeah, those are fun to deal with. I’m surprised that they couldn’t just hide the user. portion under the main URL or have the information piped through another API.
Underneath that section, another interesting error popped up:
Error: Entitlements could not be determined
...rt&&(b.entitlementCheck(),a.resolve()),new Error("Entitlements could not be dete...
site-desktop.js (line 10, col 4379)
It feels that what’s going on is a new level of security in terms of the code. Whatever happened between now and last Sunday went untested and has resulted in this foobar. My suspicion is that the executives wanted to lock down this episode pretty hard since there’s been so many spoilers about. Unfortunately, whoever made that call put too much stress on the tech team that had to deal with situation, which lead to untested code. If they had testing, then they did a piss poor job of it because this is inexcusable. In truth, my gut feeling is that overparanoia about spoilers has made the situation far worse than need be. I get that there’s an amount of respect content creators want for themselves and those that they may create derivative works from (in this case, George RR Martin). However, the situation has become utterly laughable. I mean, why cause such a PR disaster over something silly as spoilers? Sure, the Hold-The-Door situation was messed up but why go crazy over it? If it happens it happens. No one died. This is just pure ego at this point and it makes everyone look unprofessional and selfish.
I do feel quite bad for the developers. Whoever had production duty tonight must absolutely hate their life right now.
Leave a Reply
You must be logged in to post a comment.