An Engineering Update on the Dragonflight Launch
Originally Posted by Blizzard (Blue Tracker / Official Forums)
With Dragonflight’s recent launch behind us, we want to take some time to talk with you more about what occurred these past few days from an engineering viewpoint. We hope that this will provide a bit more insight on what it takes to make a global launch like this happen, what can go right, what hiccups can occur along the way, and how we manage them.

Internally, we call events like last Monday “content launch,” because launching an expansion is a process, not one day. Far from being a static game running the same way it did eighteen years ago—or even two years ago—World of Warcraft is in constant change and growth, and our deployment processes change as well.

Expansions now consist of several smaller launches: the code first goes live running the old content, then pre-launch events and new systems turn on, and finally, on content launch day, new areas, quests, and dungeons. Each stage changes different things so we can find and fix problems. But in any large, complex system, the unexpected can still occur.

One change with this expansion was that the content launch was triggered using a timed event —multiple changes to the game can be triggered to all happen at a particular time. Manually making these changes carries the risk of human error, or an internal or external tool outage. Using a timed event helps to mitigate these risks.

Another change in Dragonflight: greatly enhanced support for encrypting game data records. Encrypted records allow us to send out our client with the data that the game needs to show cutscenes, share voice lines, or unlock quests, but keep that data from being mined before players get to experience them in-game. We know the community loves WoW, and when you’re hungry to experience any morsel, it’s hard to not spoil yourself before the main course. Encrypted records allow us to take critical story beats and hide them from players until the right time to reveal them.

We now know that the lag and instability we saw last week was caused by the way these two systems interacted. The result was: they forced the simulation server (that moves your characters around the world and performs their spells and abilities) to recalculate which records should be hidden more than one hundred times a second, per simulation. As a great deal of CPU power was spent doing these calculations, the simulations became bogged down, and requests from other services to those simulation servers backed up. Players see this as lag and error messages like “World Server Down”.

As we discovered, records encrypted until a timed event unlocked them exposed a small logic error in the code: a misplaced line of code signaled to the server that it needed to recalculate which records to hide, even though nothing had changed.

Here’s some insight on how that investigation occurred. First, the clock struck 3:00 p.m. PST. We know from testing that the Horde boat arrives first, and the Alliance boat arrives next. Many of us are logged in to the game on our characters sitting on the docks in both locations in one computer window, watching logs or graphs or dashboards in other windows. We’re also on a conference call with colleagues from our support teams from all over Blizzard.

Before launch, we’ve created contingency plans for situations we’re worried about as a result of our testing. For example, for this launch, our designers created portals that players could use to get to the Dragon Isles in case the boats failed to work.

At 3:02 p.m. the Horde boat arrives on schedule. Hooray! Players pile on, including some Blizzard employees. Other employees wait (they want to be test cases in case we must turn on portals.) The players on the boats sail off, and while some do arrive on the Dragon Isles, many more are disconnected or get stuck.

Immediately we start searching logs and dashboards. There are some players on the Dragon Isles map, but not many. Colleagues having issues report their character names and realms as specific examples. Others start reporting spikes in CPU load and on our NFS (Network File Storage) that our servers use. Still others are watching in-game, reporting what they see.

Now that we’ve seen the Horde boats, we start watching for the Alliance boats to arrive. Most of them don’t, and most of the Horde boats do not return.

A picture emerges: the boats are stuck, and Dragon Isles servers are taking much longer to spin up than expected. Here’s where we really dig in and start to problem solve.

Boats have been a problem in the past, so we turn on portals while we continue investigating. Our NFS is clearly overloaded. There’s a large network queue on the service responsible for coordinating the simulation servers, making it think simulations aren’t starting, so it launches more and starts to overwhelm our hardware. Soon we discover that adding the portals has made the overload worse, because players can click the portals as many times as they want, so we turn the portals off.

As the problems persist, we work on tackling the increased load to get as many players in to play as possible, but the service is not acting like it did in pre-launch tests. We continue to problem-solve the issue and discount things we know aren’t the issue based on those tests.

Despite the lateness in the day, many continue to work while others take off to get rest so they can return early the following day to get a fresh start and relieve those who will work overnight.

By Tuesday morning, we have a better understanding of things. We know we’re sending more messages to clients about quests than usual, although later discoveries will reveal this isn’t causing problems. A new file storage API we’re using is hitting our file storage harder than usual. Some new code added for quest givers to beckon players seems slower than it should be. The service is taking a very long time to send clients all the data changes made in hotfixes. Reports are coming in that the players who have gotten to the Dragon Isles playing have started experiencing extreme lag.

Mid-Tuesday morning a coincidence happens: digging deep into the new beckon code we find hooks for the new encryption system. We start looking at the question from the other side —could the encryption system being slow explain these and other issues we’re seeing? As it turns out, yes it can. The encryption system being slow explains the hotfix problem, the file storage problem, and the lag players are experiencing. With the source identified, the author of the relevant part of the system was able to identify the error and make the needed correction.

Pushing a fix to code used across so many services isn’t like flipping a switch, and new binaries must be pushed out and turned on. We must slowly move players from the old simulations to new ones for the correction to be picked up. In fact, at one point we try to move players too quickly and cause another part of the service to suffer. Some of the affected binaries cannot be corrected without a service restart, which we delay until the fewest players are online to not disrupt players who were in the game. By Wednesday, the fix was completely out and service stability dramatically improved.

While it took some effort to identify the issue and get it fixed, our team was incredibly vigilant in investigating the issue and getting it corrected as quickly as possible. Good software engineering isn’t about never making mistakes; it’s about minimizing the chances of making them, finding them quickly when they happen, having the tools to get in the fixes right away…

…and having an amazing team to come together to make it all happen.



—The World of Warcraft Engineering Team
This article was originally published in forum thread: An Engineering Update on the Dragonflight Launch started by Lumy View original post
Comments 103 Comments
  1. Nerdslime's Avatar
    Fascinating post by Blizz. I hope they keep this kind of transparency up; though I am not the type to dogpile them on technical issues, I think clear and interesting explanations like this interrupt the negativity cycles some of the player base can get caught in. Very interesting post in any case!
  1. SniperCT's Avatar
    Oh this is exactly the kind of explanation I want to see more of.
  1. Relapses's Avatar
    Quote Originally Posted by Nerdslime View Post
    Fascinating post by Blizz. I hope they keep this kind of transparency up; though I am not the type to dogpile them on technical issues, I think clear and interesting explanations like this interrupt the negativity cycles some of the player base can get caught in. Very interesting post in any case!
    Sadly, some of the replies in this thread prove that some people will just be upset regardless of the justification. The only acceptable outcome for the unreasonable is for there to never to be a problem in the first place.
  1. huldu's Avatar
    If they would have spent that much time and effort on actually creating content for Dragonflight this could actually have been good. It's a huge disappointment. A mmo has to be more than the sum of some parts of it. If that isn't the case you might as well removing leveling and just have PvP, dungeons(m+) and heroic/mythic raids. Everything else fills no purpose.
  1. Gombadoh's Avatar
    Cool puzzle they solved.
  1. Celement's Avatar
    Quote Originally Posted by Relapses View Post
    Yeah, if only they could have figured out a way to get millions of customers to log into the game at once in a test environment...
    It isn't really possible. Most people don't log onto the beta beyond looking around for a bit. Getting thousands of people to log into a beta is a feat. After the CE level guilds you see a massive drop. You could argue they could of tested out the encryption software during a patch launch but you would get much the same issue.
  1. Relapses's Avatar
    Quote Originally Posted by Celement View Post
    It isn't really possible. Most people don't log onto the beta beyond looking around for a bit. Getting thousands of people to log into a beta is a feat. After the CE level guilds you see a massive drop. You could argue they could of tested out the encryption software during a patch launch but you would get much the same issue.
    That was sarcasm, my bad if it wasn't obvious. I think it's ridiculous that people are saying "THEY SHOULD HAVE KNOWN BETTER" as if they routinely compile billions of pristine, flawless lines of code and have never made a human mistake in their lives.
  1. Celement's Avatar
    Quote Originally Posted by Relapses View Post
    That was sarcasm, my bad if it wasn't obvious. I think it's ridiculous that people are saying "THEY SHOULD HAVE KNOWN BETTER" as if they routinely compile billions of pristine, flawless lines of code and have never made a human mistake in their lives.
    A better argument would be that they shouldn't of bothered. I don't see the encryption as necessary and comes across more as a tool to make private servers harder to spin up then for any benefit to the customer.
  1. Simulacrum's Avatar
    "We created an unnecessary problem for no reason, but we fixed it and that makes us amazing!"

    jesus christ
  1. DechCJC's Avatar
    Love this, super insightful read and a nice reminder to us all that they're just humans working at Blizzard.
  1. Feeline10's Avatar
    Quote Originally Posted by pelos View Post
    sounds like after 18 yrs they still do thing wrong and without testing... blizz doesn't stand for quality anymore.
    Please re-read the thread. I won't even say anything I've said is profound or.. what do the kids say, a "fresh take." However, the poster Yakut - read his posts in this thread.
  1. Stickiler's Avatar
    Quote Originally Posted by Celement View Post
    A better argument would be that they shouldn't of bothered. I don't see the encryption as necessary and comes across more as a tool to make private servers harder to spin up then for any benefit to the customer.
    It has literally nothing to do with private servers, it's about people datamining and spoiliing the cutscenes before the game actually launches. When the cutscenes are launched for real(when the game progresses to that point) the encryption on the cutscenes and story content will be removed.
  1. Feeline10's Avatar
    Quote Originally Posted by Simulacrum View Post
    "We created an unnecessary problem for no reason, but we fixed it and that makes us amazing!"

    jesus christ
    Very upsetting to see these types of posts. They're untrue.. and now that it's corrected, best believe other companies will adopt this encryption method so their own rabid communities cannot devour the content early, like we do with WoW.

    This was once again an unpredictable problem which will always, inevitably happen with IT in general.

    I won't say anyone who took Monday or Tuesday off is foolish, but.. consider the situation next time
  1. Simulacrum's Avatar
    Quote Originally Posted by Feeline10 View Post
    Very upsetting to see these types of posts. They're untrue.. and now that it's corrected, best believe other companies will adopt this encryption method so their own rabid communities cannot devour the content early, like we do with WoW.

    This was once again an unpredictable problem which will always, inevitably happen with IT in general.

    I won't say anyone who took Monday or Tuesday off is foolish, but.. consider the situation next time
    Untrue? They literally say they created the problem then fixed it and then they sycophantically complimented themselves for having fixed a problem that never needed to exist in the first place.

    That's like if I ran up and broke your window, then demanded you pay me to fix it for you, and then I spend weeks fixing it while your house is wide open to the wind, and then I call myself an amazing person for having made you pay me money to fix your window that I broke. It's psychotic.
  1. nulian84's Avatar
    Quote Originally Posted by Simulacrum View Post
    Untrue? They literally say they created the problem then fixed it and then they sycophantically complimented themselves for having fixed a problem that never needed to exist in the first place.

    That's like if I ran up and broke your window, then demanded you pay me to fix it for you, and then I spend weeks fixing it while your house is wide open to the wind, and then I call myself an amazing person for having made you pay me money to fix your window that I broke. It's psychotic.
    I think that feature needs to exist a lot too much content is datamined and then spoiled by wowhead/content creators way before release.
    So it's good important story parts are kept hidden till the first player arrived at that location.
  1. Relapses's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    I would be curious to know why this didn't appear on the beta - maybe a limitation in the testing environment?
    There weren't enough people. This was one of those things that would only show up if they crammed millions of players into the same place.
  1. Relapses's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    I don't understand this - it doesn't sound like it was something caused by mass scaling issues such that you couldn't test it by cramming 10,000 into a smaller server.
    They did test it; as explained, it was an unexpected interaction between two different systems that they couldn't have anticipated. At the level of thousands like you'd have on the test realms, it's fine. But when you spool that up to millions, a minor imperceptible problem reaches its breaking point.
  1. Relapses's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    They said it was a logic error, not an unexpected interaction related to scaling.
    Scaling obviously had something to do with it because the issue was there yet wasn't detected prior to the launch.
  1. Slowpoke is a Gamer's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    They said it was a logic error, not an unexpected interaction related to scaling.
    Logic error with 10000 people might push a CPU to 50% and everything works as intended.

    Logic error with 1000000 people pushes that CPU to 100% and then the game melts down.

    It can be both a logic error and something that ends up getting worse the more people are in a location to trigger the error.
  1. Schmeebs's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    They said it was a logic error, not an unexpected interaction related to scaling.
    You are an idiot

Site Navigation