Amazon Web Services Hit By Outage

Several high-profile websites and services have been knocked offline by a failure at one of Amazon’s major US data centres.

Amazon Web Services (AWS) allows firms to rent cloud servers in order to host data on the internet without needing to invest in their own infrastructure.

On Tuesday, sites such as Quora, a Q&A forum, Trello, Slack, Splitwise, Soundcloud and Medium were among the popular internet services the were impacted.

Amazon said it is “working hard at repairing” the problem.

“We believe we understand root cause,” the company said.

Other services, including Slack, have also lost some key functionality.

Specifically, it was AWS’s S3 – which stands for Simple Storage Service – that was affected, in US-EAST-1.

To varying degrees it serves around 150,000 sites and services around the world, mostly in the US.

AWS is used by some of the web’s most recognisable and powerful names including Netflix, Spotify and Airbnb. While none of those services went offline on Tuesday, users did report performance issues and slowdown.

US government services such as the Securities and Exchange Commission (SEC) were also affected.

Downtime is a critical issue for any cloud service. Amazon competes with Google, Microsoft and others for what is an increasingly lucrative line of business for the web giants.

To help mitigate this PALSS can either load balance your traffic between provider and when an outage like this occurs, bypass the failed servers. Or simply provide a landing page to direct your clients to alternative methods to contact you.

And you can avoid

In this case, yes it is!

** UPDATED 03/02/2017 **

So it appears a single mistyped command from an operator was responsible

“The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process,” the team wrote in its message.

“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”

Mags says:

February 28, 2017 at 10:55 pm

Too many people (non IT folk) seem to think that the cloud is this magical place that never has an issue. No matter how many outages Amazon, Azure, etc have, people still seem to think that it’s made of magic.

Deploy in the cloud by all means but still backup, replicate, ensure that you don’t have a single point of failure.

Several high-profile websites and services have been knocked offline by a failure at one of Amazon’s major US data centres.

Leave a Reply

Leave a Reply Cancel reply