An Engineering Update on the Dragonflight Launch
Originally Posted by Blizzard (Blue Tracker / Official Forums)
With Dragonflight’s recent launch behind us, we want to take some time to talk with you more about what occurred these past few days from an engineering viewpoint. We hope that this will provide a bit more insight on what it takes to make a global launch like this happen, what can go right, what hiccups can occur along the way, and how we manage them.

Internally, we call events like last Monday “content launch,” because launching an expansion is a process, not one day. Far from being a static game running the same way it did eighteen years ago—or even two years ago—World of Warcraft is in constant change and growth, and our deployment processes change as well.

Expansions now consist of several smaller launches: the code first goes live running the old content, then pre-launch events and new systems turn on, and finally, on content launch day, new areas, quests, and dungeons. Each stage changes different things so we can find and fix problems. But in any large, complex system, the unexpected can still occur.

One change with this expansion was that the content launch was triggered using a timed event —multiple changes to the game can be triggered to all happen at a particular time. Manually making these changes carries the risk of human error, or an internal or external tool outage. Using a timed event helps to mitigate these risks.

Another change in Dragonflight: greatly enhanced support for encrypting game data records. Encrypted records allow us to send out our client with the data that the game needs to show cutscenes, share voice lines, or unlock quests, but keep that data from being mined before players get to experience them in-game. We know the community loves WoW, and when you’re hungry to experience any morsel, it’s hard to not spoil yourself before the main course. Encrypted records allow us to take critical story beats and hide them from players until the right time to reveal them.

We now know that the lag and instability we saw last week was caused by the way these two systems interacted. The result was: they forced the simulation server (that moves your characters around the world and performs their spells and abilities) to recalculate which records should be hidden more than one hundred times a second, per simulation. As a great deal of CPU power was spent doing these calculations, the simulations became bogged down, and requests from other services to those simulation servers backed up. Players see this as lag and error messages like “World Server Down”.

As we discovered, records encrypted until a timed event unlocked them exposed a small logic error in the code: a misplaced line of code signaled to the server that it needed to recalculate which records to hide, even though nothing had changed.

Here’s some insight on how that investigation occurred. First, the clock struck 3:00 p.m. PST. We know from testing that the Horde boat arrives first, and the Alliance boat arrives next. Many of us are logged in to the game on our characters sitting on the docks in both locations in one computer window, watching logs or graphs or dashboards in other windows. We’re also on a conference call with colleagues from our support teams from all over Blizzard.

Before launch, we’ve created contingency plans for situations we’re worried about as a result of our testing. For example, for this launch, our designers created portals that players could use to get to the Dragon Isles in case the boats failed to work.

At 3:02 p.m. the Horde boat arrives on schedule. Hooray! Players pile on, including some Blizzard employees. Other employees wait (they want to be test cases in case we must turn on portals.) The players on the boats sail off, and while some do arrive on the Dragon Isles, many more are disconnected or get stuck.

Immediately we start searching logs and dashboards. There are some players on the Dragon Isles map, but not many. Colleagues having issues report their character names and realms as specific examples. Others start reporting spikes in CPU load and on our NFS (Network File Storage) that our servers use. Still others are watching in-game, reporting what they see.

Now that we’ve seen the Horde boats, we start watching for the Alliance boats to arrive. Most of them don’t, and most of the Horde boats do not return.

A picture emerges: the boats are stuck, and Dragon Isles servers are taking much longer to spin up than expected. Here’s where we really dig in and start to problem solve.

Boats have been a problem in the past, so we turn on portals while we continue investigating. Our NFS is clearly overloaded. There’s a large network queue on the service responsible for coordinating the simulation servers, making it think simulations aren’t starting, so it launches more and starts to overwhelm our hardware. Soon we discover that adding the portals has made the overload worse, because players can click the portals as many times as they want, so we turn the portals off.

As the problems persist, we work on tackling the increased load to get as many players in to play as possible, but the service is not acting like it did in pre-launch tests. We continue to problem-solve the issue and discount things we know aren’t the issue based on those tests.

Despite the lateness in the day, many continue to work while others take off to get rest so they can return early the following day to get a fresh start and relieve those who will work overnight.

By Tuesday morning, we have a better understanding of things. We know we’re sending more messages to clients about quests than usual, although later discoveries will reveal this isn’t causing problems. A new file storage API we’re using is hitting our file storage harder than usual. Some new code added for quest givers to beckon players seems slower than it should be. The service is taking a very long time to send clients all the data changes made in hotfixes. Reports are coming in that the players who have gotten to the Dragon Isles playing have started experiencing extreme lag.

Mid-Tuesday morning a coincidence happens: digging deep into the new beckon code we find hooks for the new encryption system. We start looking at the question from the other side —could the encryption system being slow explain these and other issues we’re seeing? As it turns out, yes it can. The encryption system being slow explains the hotfix problem, the file storage problem, and the lag players are experiencing. With the source identified, the author of the relevant part of the system was able to identify the error and make the needed correction.

Pushing a fix to code used across so many services isn’t like flipping a switch, and new binaries must be pushed out and turned on. We must slowly move players from the old simulations to new ones for the correction to be picked up. In fact, at one point we try to move players too quickly and cause another part of the service to suffer. Some of the affected binaries cannot be corrected without a service restart, which we delay until the fewest players are online to not disrupt players who were in the game. By Wednesday, the fix was completely out and service stability dramatically improved.

While it took some effort to identify the issue and get it fixed, our team was incredibly vigilant in investigating the issue and getting it corrected as quickly as possible. Good software engineering isn’t about never making mistakes; it’s about minimizing the chances of making them, finding them quickly when they happen, having the tools to get in the fixes right away…

…and having an amazing team to come together to make it all happen.



—The World of Warcraft Engineering Team
This article was originally published in forum thread: An Engineering Update on the Dragonflight Launch started by Lumy View original post
Comments 107 Comments
  1. Relapses's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    They said it was a logic error, not an unexpected interaction related to scaling.
    Scaling obviously had something to do with it because the issue was there yet wasn't detected prior to the launch.
  1. Slowpoke is a Gamer's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    They said it was a logic error, not an unexpected interaction related to scaling.
    Logic error with 10000 people might push a CPU to 50% and everything works as intended.

    Logic error with 1000000 people pushes that CPU to 100% and then the game melts down.

    It can be both a logic error and something that ends up getting worse the more people are in a location to trigger the error.
  1. SpaghettiMonk's Avatar
    Quote Originally Posted by Slowpoke is a Gamer View Post
    Logic error with 10000 people might push a CPU to 50% and everything works as intended.

    Logic error with 1000000 people pushes that CPU to 100% and then the game melts down.

    It can be both a logic error and something that ends up getting worse the more people are in a location to trigger the error.
    Sure - I'm just saying I wish they had said that. There are other potential explanations as well, like something to do with the differences between live and beta servers. "Why didn't we catch this during testing" is a very important question that should be asked and answered thoroughly every time there's a bug like this. I'm not naive enough to say that they definitely should have caught it, but the omission of that particular detail feels like a gap to me.
  1. Schmeebs's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    They said it was a logic error, not an unexpected interaction related to scaling.
    You are an idiot
  1. iperson's Avatar
    best buy geek squad chiming in on these threads is always hilarious. "I installed Linux once, here's how blizzard can fix wow"
  1. bmjclark's Avatar
    Interesting post to read through tbh. I'd love if they'd do these kinds of posts more often (although hopefully not related to server difficulties next time). It's interesting to hear some of the problems they run into. A day or 2 of broken ass servers sucks but it could be a whole lot worse (WoD) so whatever.
  1. Feeline10's Avatar
    Quote Originally Posted by iperson View Post
    best buy geek squad chiming in on these threads is always hilarious. "I installed Linux once, here's how blizzard can fix wow"
    Re-read the thread and it should readily evident who is who and what's up.
  1. Araitik's Avatar
    So many people here have 0 idea what it's like to push software in production, or even write software. And I'm not even talking about the scale Blizzard is working at.

    Release software to even ONE customer, then come back, and then we'll talk about who's an "incompetent dev that doesn't test anything"
  1. Gorsameth's Avatar
    Quote Originally Posted by SpaghettiMonk View Post
    Sure - I'm just saying I wish they had said that. There are other potential explanations as well, like something to do with the differences between live and beta servers. "Why didn't we catch this during testing" is a very important question that should be asked and answered thoroughly every time there's a bug like this. I'm not naive enough to say that they definitely should have caught it, but the omission of that particular detail feels like a gap to me.
    But they did say it.
    The result was: they forced the simulation server (that moves your characters around the world and performs their spells and abilities) to recalculate which records should be hidden more than one hundred times a second, per simulation. As a great deal of CPU power was spent doing these calculations, the simulations became bogged down, and requests from other services to those simulation servers backed up.
    Without enough people the server can handle the 100's of calls a second and you might see more load then expected if your looking for it but the server itself will perform fine.

    And because the simulation servers could handle the load from the beta the second cascade didn't happen.
    There’s a large network queue on the service responsible for coordinating the simulation servers, making it think simulations aren’t starting, so it launches more and starts to overwhelm our hardware. Soon we discover that adding the portals has made the overload worse, because players can click the portals as many times as they want, so we turn the portals off.
  1. Twdft's Avatar
    This really made me laugh:

    With the source identified, the author of the relevant part of the system was able to identify the error and make the needed correction.
    So there is one guy who understands the code needed. Others did manage to identify the code that had to be fixed, but only that one dude could actually fix it. I guess no documentation available. Who could've seen that coming? Well actually everyone who was involved in software being developed. I think you'll have to 24/7 hold a pistol to the head of a developer to document their code correctly, it's not happening on its own.
  1. Gorsameth's Avatar
    Quote Originally Posted by Twdft View Post
    This really made me laugh:



    So there is one guy who understands the code needed. Others did manage to identify the code that had to be fixed, but only that one dude could actually fix it. I guess no documentation available. Who could've seen that coming? Well actually everyone who was involved in software being developed. I think you'll have to 24/7 hold a pistol to the head of a developer to document their code correctly, it's not happening on its own.
    Why would they need to get someone else to read the documentation and go through the code finding the bit that needs to be changed when they have the guy who wrote it right there who knows what bit he is looking for?

    Your seriously reading to much into this looking for stuff that lets you shout "omg Blizzard bad".
  1. Twdft's Avatar
    Quote Originally Posted by Gorsameth View Post
    Why would they need to get someone else to read the documentation and go through the code finding the bit that needs to be changed when they have the guy who wrote it right there who knows what bit he is looking for?

    Your seriously reading to much into this looking for stuff that lets you shout "omg Blizzard bad".
    With all other fixes it was just "we fixed it". They specifically mentioned the author fixing that one thing.

    Also, I did very much not say "Blizzard bad". Poor documentation happens everywhere where software is developed. I don't think you'll find a single piece of well documented software on earth if the author did not have a pistol on their head 24/7
  1. Relapses's Avatar
    Quote Originally Posted by Twdft View Post
    With all other fixes it was just "we fixed it". They specifically mentioned the author fixing that one thing.

    Also, I did very much not say "Blizzard bad". Poor documentation happens everywhere where software is developed. I don't think you'll find a single piece of well documented software on earth if the author did not have a pistol on their head 24/7
    It's hard to say whether the author was specifically chosen because of the "poor documentation" you're implying or because that's simply how Blizzard operates. Either way it's fairly immaterial, though I'm sure Blizzard learned lessons from this.
  1. bloodykiller86's Avatar
    Quote Originally Posted by Biomega View Post
    I got bad news for you lol
    .....you realize that is a normal part of acquisitions right? lol

    - - - Updated - - -

    Quote Originally Posted by Hablion View Post
    Well the FTC has decided to Sue to Block the Merger so it is quite likely that they might not get ABK after all.
    they have no real case to block it anyways so theyre just doing that to get concessions as they need to file a suit to do so.

    - - - Updated - - -

    Quote Originally Posted by ablib View Post
    The EU and now the US are trying to block it. Also, whatever you said, has nothing to do with the problem.

    https://www.cnn.com/2022/12/08/tech/...ion/index.html
    and neither will win they will just get concessions because Sony is in with these politicians

    - - - Updated - - -

    Quote Originally Posted by dwarven View Post
    That would not have helped anything in this case. As a developer and cloud user myself, bad code is bad code no matter where it runs. Look at New World, which runs on AWS.
    did i say at all that it would help with bad code? no lol im just saying azure servers will be way better than anything else
  1. Yakut's Avatar
    Quote Originally Posted by bloodykiller86 View Post
    did i say at all that it would help with bad code? no lol im just saying azure servers will be way better than anything else
    It is better for you to keep silent and be suspected of being an idiot than to open your mouth and remove all doubt. Unfortunately, you didn't have self control and now everyone with a teaspoon of knowledge knows exactly how little you do have.

    Azure isn't better (or worse). Same with GCP and AWS. None of them would have solved this problem. But your armchair analysis reveals you have no clue about what happened, what it took to resolve, and what would have prevented it.

    I congratulate you on the efficiency of your post with its ability to expose your ignorance. Bravo. I doubt I could have done better.
  1. Duende's Avatar
    Blizzard did more harm than good by releasing this info.
  1. DatToffer's Avatar
    Quote Originally Posted by Lumy View Post
    …and having an amazing team to come together to make it all happen.
    I sure hope this team got paid the extra hours to be amazing.
  1. Yakut's Avatar
    Quote Originally Posted by Duende View Post
    Blizzard did more harm than good by releasing this info.
    I disagree. It shows they're learning, albeit slowly, that transparency and open communication are signs of a healthy company. There will always be ignorant people who misunderstood what happened even when explained clearly (as this page of this thread has done before your post). But even that's not Blizzard's fault.

    They need to have this exposed for a lot of reasons. But first and foremost so customers understand the nature of the problem. Secondly, Blizzard, despite the analysis of some in this thread, is also acknowledging they screwed up not that they were proud to fix a problem they didn't know they created. And, lastly, perhaps most importantly, owning up to a problem in a public forum is not easy as a person let alone a company.

    I find it interesting that people complain they want honest and transparent communication only to the blast the company for doing that. Can they ever do anything right if they listen to the customers in this scenario? I think that's one Kobayashi Maru scenario I'd not like to be in.
  1. bloodykiller86's Avatar
    Quote Originally Posted by Yakut View Post
    It is better for you to keep silent and be suspected of being an idiot than to open your mouth and remove all doubt. Unfortunately, you didn't have self control and now everyone with a teaspoon of knowledge knows exactly how little you do have.

    Azure isn't better (or worse). Same with GCP and AWS. None of them would have solved this problem. But your armchair analysis reveals you have no clue about what happened, what it took to resolve, and what would have prevented it.

    I congratulate you on the efficiency of your post with its ability to expose your ignorance. Bravo. I doubt I could have done better.
    jesus christ I DIDNT SAY it would help with this issue omfg! lol im just saying IN GENERAL Azure servers would be better in the long run than what they use you fool. Im in IT of course i know that issue they explained wouldnt make a difference on Azure servers it would just have more capacity to run until it killed itself if anything at all lol
  1. dwarven's Avatar
    Quote Originally Posted by bloodykiller86 View Post
    did i say at all that it would help with bad code? no lol im just saying azure servers will be way better than anything else
    Not necessarily, no. They'd be unlikely to migrate to Azure anyway, since that would remove a lot of capacity from that part of Microsoft's business. Blizzard has already built the infrastructure to run WoW, so they'll 99.9% just keep using that. Additionally, a cloud migration of that scale would cost a lot of money, take a year or longer, and could break a lot of things that are already working just fine as is. Sorry, but you have no idea what you're talking about.

Site Navigation