1. #1

    Hey look, we're back online!

    Hey look, we're back online!
    Oh well, that sucked. About 48 hours ago MMO-Champion (and the entire Curse Network) went down without warning and the downtime lasted a ... little more than expected.

    The site is currently in read-only mode
    We're currently operating in read-only mode on a 8 days old database. This was the best way to bring back the site online tonight without interrupting the efforts to restore/fix everything. You will not be able to posts on forums and anything done in the past 5 days before the crash will not be visible for the moment.

    If everything goes as planned, tomorrow we will be able to restore everything to the state it was before the crash and the site will be reopened. (It might happen earlier, but let's say tomorrow)

    What happened?
    Basically, everything went down when the Storage area network decided to fail, there was no real way to prevent that and it instantly killed all the sites of the network. It's ok, we just used the backup controller that we had, I mean, we totally had that covered!

    Then the backup controller failed. We worked with HP to get replacement parts as soon as possible and we had everything replaced within a couple of hours. The problem is, we still had a NAS that crashed pretty badly in our hands and we had to check for corrupted data ... for a very long time. The process literally took over 30 hours and all efforts made during this time to restore the site temporarily failed. We also had consultants on site and worked closely with HP and Microsoft to figure out if there was a way to speed up things without risking losing data without much success.

    The NAS eventually came back online very early in the morning and a couple of sites were brought back online (including the blue tracker a little later)

    Then, we got another hilarious problem, the faulty hardware crashed so badly that it bugged the firmware (yes, seriously) and our only hope was to get a bugfix from Hewlett Packard at this point. We're currently working with them and everything should be sorted out soon but it will take a couple of extra hours, that's why we decided to bring back the site in read only mode for the moment.

    After the fix is applied, we will be able to restore the database to its pre-crash state and open the site to the public once again with the remaining news and forum posts. (You know, assuming everything goes as planned)

    So, not a hack?
    Nop. I know people freaked out a little but it really was a pure hardware failure, nothing got compromised.

    But you could have avoided it!
    With another setup, probably, but realistically it's pretty hard to predict all the technical failures you could have and damn, what are the chances that both controllers will fail within 2 hours of each other? For the record, the NAS was part of the brand new hardware bought a year ago to upgrade the entire infrastructure. Of course, lessons will be learned and I'm sure the lovely techy guys will work on ways to prevent that in the future, and it would be pretty unfair to bash them at this time because they pretty much worked 25 hours a day to fix the problem.

    I like you guys.
    I would like to apologize for the whole situation, realistically, a whole site exploding in the face of users for 2 days isn't something that should happen in a perfect world but hey, shit does happen. The community has been incredibly supportive throughout the whole thing and I was amazed by all the nice messages we got on Facebook. Really, I expected you guys to eat me alive and call me nasty names but you didn't. Thank you.

    I hope we're still friends.

    Boubouille

    PS: For the latest news posts, see the news below. If you have any question, poke me on @Boubouille_MMO - Twitter and I will try to reply to everyone.

  2. #2
    Stood in the Fire
    10+ Year Old Account
    Join Date
    Jan 2011
    Location
    right here
    Posts
    394
    am I the only one that read "Good news everyone!" in Professor Farnsworth's voice?

  3. #3
    Deleted
    Quote Originally Posted by mrbadxampl View Post
    am I the only one that read "Good news everyone!" in Professor Farnsworth's voice?
    <--
    Also.

    ​10char

  4. #4
    Bloodsail Admiral Bikni's Avatar
    10+ Year Old Account
    Join Date
    Oct 2009
    Location
    In a galaxy far far away
    Posts
    1,192
    Welcome back

  5. #5
    Quote Originally Posted by mrbadxampl View Post
    am I the only one that read "Good news everyone!" in Professor Farnsworth's voice?
    http://img3.visualizeus.com/thumbs/e...70cb629b_h.jpg

  6. #6
    HP gear really blows goats. The only reason it's so popular with sysadmins is that their tech support people are really responsive and supportive. Mind you, I guess they have to plan to fail when their hardware sucks so badly. You could get some decent IBM kit and it would only fail one-tenth of the time the HP kit does. Unfortunately when it does go (and it always will eventually), their support network is so slow and unfriendly in comparison to HP, you probably end up having about the same amount of downtime overall as you did with HP. Swings and roundabouts...

  7. #7
    lulzsec hacked us!

  8. #8
    FYI, A Storage Area Network (SAN) and a Network Attached Storage (NAS) are 2 very different technologies and are not interchangeable terms. Other than that i am glad the site is back up. Being a Consulting Engineer for many organizations I can say that I have felt this sort of pain before.

  9. #9
    I am Murloc! -Zait-'s Avatar
    10+ Year Old Account
    Join Date
    Feb 2011
    Location
    ♫ ♪ d(Θ.Θ)b ♪ ♫
    Posts
    5,490
    i missed you too mmo champion forums









  10. #10
    High Overlord Pekoe's Avatar
    10+ Year Old Account
    Join Date
    Jun 2009
    Location
    Massachusetts
    Posts
    114
    Quote Originally Posted by mrbadxampl View Post
    am I the only one that read "Good news everyone!" in Professor Farnsworth's voice?
    I was actually having ICC heroic mode flashbacks...
    I really loved when we finally killed that guy, he finally STFU and gave us purple stuff.

    And Boub... Next time, just tell them: "It's not my fault that you suck!" when people complain.

    <3
    -
    Standing in fire since December 2004.

  11. #11
    Keyboard Turner
    10+ Year Old Account
    Join Date
    May 2011
    Location
    Melbourne, Australia
    Posts
    6
    Get some NetApp filers please

  12. #12
    there was no real way to prevent that
    This is exactly why you should use active/active SAN controllers with dual port SAS drives. If you do this, there will not be a single point of failure. This IS the real way to prevent that exact scenario.

    To all the people singling out the different vendors: No vendor can prevent the admin/engineer from poorly planning/configuring a system. Additionally, most of the technology for the underlying parts is actually generic. HP doesn't make Fiber channel cards, they pay Foxconn to make a card using a QLogic controller (the actual chip) and put an HP sticker/BIOS logo on it (this is just an example).

  13. #13
    Quote Originally Posted by mrbadxampl View Post
    am I the only one that read "Good news everyone!" in Professor Farnsworth's voice?
    I was thinking Professor Putrice! lol "Good News everyone the slime is back"!

  14. #14
    Quote Originally Posted by Celfydd View Post
    HP gear really blows goats. The only reason it's so popular with sysadmins is that their tech support people are really responsive and supportive. Mind you, I guess they have to plan to fail when their hardware sucks so badly. You could get some decent IBM kit and it would only fail one-tenth of the time the HP kit does. Unfortunately when it does go (and it always will eventually), their support network is so slow and unfriendly in comparison to HP, you probably end up having about the same amount of downtime overall as you did with HP. Swings and roundabouts...
    HP has really gone down on the quality control... not that they were all that good to begin with.

  15. #15
    We use several Dell Equalogic SANs and honestly, for one to fail, it normally means that someone didn't address a problem several weeks ago, and then a second failure occurred.

  16. #16
    missed you guys online!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •