| As many of you have noticed, the forum was down for almost the entire Sunday. It first went down on 3:30am Pacific Time January 29th, 2012. I did not know until I wake up by 8:00am. I quickly sent several notices to the http://facebook.com/groups/wmexpressway to spread awareness on the problem. I then submitted a ticket to the our host's tech support. They responded with the following within 15 minutes:
|QUOTE (Reply from DreamHost (Jan 29 @ 2012 - 09:13:20 / #5393xxxx))|
|Subject: Re: Site Down|
First, I'd like to apologize for the use of a canned response to this support issue. We have identified and are actively working on a fix for the problems you are experiencing with FTP and Web services. This issue is affecting a large subset of our customers and as such our system administration, data center operations, and development teams are all working on resolving these issues as quickly as possible.
We are making this information available, along with updates, on the http://www.dreamhoststatus.com/ website. Please follow this url to keep up to date with any future updates.
Again, we do apologize for the lack of a personal response to this support request and if you have any further questions please let us know.
Thank you for your understanding in this matter,
DreamHost Technical Support
Basically, they were aware of the issue. As time progress through the day, many customers like me were getting angry at this long downtime. Finally by 11:00pm Pacific Time, the site is up. The entire downtime lasted 19.5 hours. Here is a message from the CEO of the host;
|Update Jan 29th, 9:40pm PST:|
From Simon Anderson, CEO, DreamHost: My sincere apologies for the downtime experienced today by many of our dedicated and VPS customers, plus some shared customers. I know that this has been a poor customer experience for you. Almost all services are back up after an intense effort from the DreamHost dev, admin, data center and support teams. I was involved in the coordination of our efforts today and now am able to share what happened, and what we’re going to do to reduce the risk that it happens again.
We run Debian OS and have used autoupdates to ensure security packages are installed as soon as they are available. We’ve had some breakage in the past from this approach, but nothing major. However last night’s autoupdate went badly wrong, removing essential packages from dedicated, VPS and some shared servers. Our monitoring and support team flagged the issue fast, and we scrambled our admin, dev and NOC teams to reinstall the packages that had been removed by autoupdate, reboot servers, fix package dependencies, and test that individual services were live. Given the number of services affected, this took a long time to complete. Rest assured we had all hands working on the issue, but I know it was still a frustrating experience for customers.
To mitigate the risk of anything like this happening again, we’re immediately switching off autoupdates, and moving to a manual process where we’ll only push out Debian updates after significant testing. There’s always a balance to be struck between speed, efficiency, security and issue prevention, but this event has shown us that we need to take a different approach. Again, my apologies for the downtime experienced today. We’re acutely focused on adjusting our processes and systems to ensure we do a better job going forward. – Simon
It was caused by a botched automatic update that somehow caused automatic deletion of some important modules on the server, which caused the service to go down shortly after 3:30am. I am currently in the process of getting some sort of compensation for such long downtime. If they refuse, I might be forced to switch to another host. This sort of downtime is simply unacceptable.