Welcome Guest ( Log In | Register ) Resend Validation Email


  Reply to this topicStart new topicStart Poll

Thread Rating: 1 votes, 5.00 average.  Rate:  
  • Thread Rating: 1 votes, 5.00 average.  Rate:  
  • 1
  • 2
  • 3
  • 4
  • 5
> Explanation on Long Downtime (2012-1-29), Please read if you are interested
E86
  Posted: Jan 29 2012, 11:37 PM
Quote Post


Lead Administrator
Group Icon

Group: ADMINISTRATOR
Posts: 587
Member No.: 2
Joined: Dec 22nd 2006
Location: San Leandro, CA





As many of you have noticed, the forum was down for almost the entire Sunday. It first went down on 3:30am Pacific Time January 29th, 2012. I did not know until I wake up by 8:00am. I quickly sent several notices to the facebook group to spread awareness on the problem. I then submitted a ticket to the our host's tech support. They responded with the following within 15 minutes:

QUOTE (Reply from DreamHost (Jan 29 @ 2012 - 09:13:20 / #5393xxxx))
Subject: Re: Site Down
First, I'd like to apologize for the use of a canned response to this support issue. We have identified and are actively working on a fix for the problems you are experiencing with FTP and Web services. This issue is affecting a large subset of our customers and as such our system administration, data center operations, and development teams are all working on resolving these issues as quickly as possible.

We are making this information available, along with updates, on the http://www.dreamhoststatus.com/ website. Please follow this url to keep up to date with any future updates.

http://www.dreamhoststatus.com/2012/01/29/...s-for-a-subset- of-vps-shared-and-dedicated-machines-down/

Again, we do apologize for the lack of a personal response to this support request and if you have any further questions please let us know.

Thank you for your understanding in this matter,
JJ Galvez
DreamHost Technical Support


Basically, they were aware of the issue. As time progress through the day, many customers like me were getting angry at this long downtime. Finally by 11:00pm Pacific Time, the site is up. The entire downtime lasted 19.5 hours. Here is a message from the CEO of the host;

QUOTE
Update Jan 29th, 9:40pm PST:

From Simon Anderson, CEO, DreamHost: My sincere apologies for the downtime experienced today by many of our dedicated and VPS customers, plus some shared customers. I know that this has been a poor customer experience for you. Almost all services are back up after an intense effort from the DreamHost dev, admin, data center and support teams. I was involved in the coordination of our efforts today and now am able to share what happened, and what we’re going to do to reduce the risk that it happens again.

We run Debian OS and have used autoupdates to ensure security packages are installed as soon as they are available. We’ve had some breakage in the past from this approach, but nothing major. However last night’s autoupdate went badly wrong, removing essential packages from dedicated, VPS and some shared servers. Our monitoring and support team flagged the issue fast, and we scrambled our admin, dev and NOC teams to reinstall the packages that had been removed by autoupdate, reboot servers, fix package dependencies, and test that individual services were live. Given the number of services affected, this took a long time to complete. Rest assured we had all hands working on the issue, but I know it was still a frustrating experience for customers.

To mitigate the risk of anything like this happening again, we’re immediately switching off autoupdates, and moving to a manual process where we’ll only push out Debian updates after significant testing. There’s always a balance to be struck between speed, efficiency, security and issue prevention, but this event has shown us that we need to take a different approach. Again, my apologies for the downtime experienced today. We’re acutely focused on adjusting our processes and systems to ensure we do a better job going forward. – Simon


It was caused by a botched automatic update that somehow caused automatic deletion of some important modules on the server, which caused the service to go down shortly after 3:30am. I am currently in the process of getting some sort of compensation for such long downtime. If they refuse, I might be forced to switch to another host. This sort of downtime is simply unacceptable.
PMEmail PosterUsers Website
Top
+-!mma_N00B-+
Posted: Jan 30 2012, 05:27 AM
Quote Post


時よ止まれ!
**********

Group: Core Members
Posts: 5,287
Member No.: 7,398
Joined: Oct 15th 2010
Location: Cebu, Philippines





Call me soft, but I'd accept his apology and see if the server will still work well if it was me. laugh.gif
PMEmail PosterUsers Website
Top
geokilla
Posted: Jan 30 2012, 01:47 PM
Quote Post


Bored
**********

Group: Advanced Members
Posts: 3,109
Member No.: 2,683
Joined: Apr 24th 2008
Location: North York, Ontario, Canada





QUOTE (+-!mma_N00B-+ @ Today, 8:27 AM)
Call me soft, but I'd accept his apology and see if the server will still work well if it was me. laugh.gif

That and WME is less active now. No need to go through all that trouble for us Perry, but at least get a day's worth of compensation.
PMEmail Poster
Top
Geo
Posted: Jan 31 2012, 04:19 AM
Quote Post


all bow before me
Group Icon

Group: FORUM MODERATOR
Posts: 33,043
Member No.: 1,760
Joined: Dec 25th 2007
Location: Sydney, Australia





You guys are sort of misunderstanding the problem at hand. The reason we moved to the current server we're now was because the previous one was fairly unreliable and the amount of quality the current server was providing us was good until the recent downtime.

At the end of the day, it's up to E86 to choose what to do with the forum and where to take it.
PMEmail PosterMSN
Top

Topic Options  Reply to this topicStart new topicStart Poll