AWS S3 掛掉原因：程序員輸錯(cuò)字母，誤刪服務(wù)器，故障4小時(shí)！

MarvinZhang 發(fā)布于2019-04-25 17:45 / 1130人閱讀

摘要：周四聲稱，輸錯(cuò)命令導(dǎo)致了亞馬遜網(wǎng)絡(luò)服務(wù)出現(xiàn)持續(xù)數(shù)小時(shí)的故障事件。太平洋標(biāo)準(zhǔn)時(shí)上午，一名獲得授權(quán)的團(tuán)隊(duì)成員使用事先編寫的，執(zhí)行一條命令，該命令旨在為計(jì)費(fèi)流程使用的其中一個(gè)子系統(tǒng)刪除少量服務(wù)器。

AWS解釋了其廣大US-EAST-1地理區(qū)域的S3存儲(chǔ)服務(wù)是如何受到中斷的，以及它在采取什么措施防止這種情況再次發(fā)生。

AWS周四聲稱，輸錯(cuò)命令導(dǎo)致了亞馬遜網(wǎng)絡(luò)服務(wù)（AWS）出現(xiàn)持續(xù)數(shù)小時(shí)的故障事件。這起事件導(dǎo)致周二知名網(wǎng)站斷網(wǎng)，并給另外幾個(gè)網(wǎng)站帶來了問題。

這家云基礎(chǔ)設(shè)施提供商給出了以下解釋：

亞馬遜簡(jiǎn)單存儲(chǔ)服務(wù)（S3）團(tuán)隊(duì)當(dāng)時(shí)在調(diào)試一個(gè)問題，該問題導(dǎo)致S3計(jì)費(fèi)系統(tǒng)的處理速度比預(yù)期來得慢。太平洋標(biāo)準(zhǔn)時(shí)（PST）上午9：37，一名獲得授權(quán)的S3團(tuán)隊(duì)成員使用事先編寫的playbook，執(zhí)行一條命令，該命令旨在為S3計(jì)費(fèi)流程使用的其中一個(gè)S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是，輸入命令時(shí)輸錯(cuò)了一個(gè)字母，結(jié)果刪除了一大批本不該刪除的服務(wù)器。

這個(gè)錯(cuò)誤無意中刪除了對(duì)US-EAST-1區(qū)域的所有S3對(duì)象而言至關(guān)重要的兩個(gè)子系統(tǒng)――這個(gè)區(qū)域是龐大的數(shù)據(jù)中心地區(qū)，恰恰也是亞馬遜歷史最悠久的區(qū)域。兩個(gè)系統(tǒng)都需要完全重啟。亞馬遜特別指出，這個(gè)過程以及運(yùn)行必要的安全檢查“所花的時(shí)間超出了預(yù)期。”

重新啟動(dòng)時(shí)，S3無法處理服務(wù)請(qǐng)求。該區(qū)域依賴S3進(jìn)行存儲(chǔ)的其他AWS服務(wù)也受到了影響，包括S3控制臺(tái)、亞馬遜彈性計(jì)算云（EC2）新實(shí)例的啟動(dòng)、亞馬遜彈性塊存儲(chǔ)（EBS）卷（需要從S3快照獲取數(shù)據(jù)時(shí)）以及AWSLambda。

亞馬遜特別指出，索引子系統(tǒng)到下午1：18分已完全恢復(fù)，而布置子系統(tǒng)在下午1：54分恢復(fù)正常。到那時(shí)，S3已正常運(yùn)行。

AWS特別指出，由于這起事件，自己正在“做幾方面的變化”，包括采取將來防止錯(cuò)誤輸入引發(fā)此類問題的措施。

官方博客解釋：“雖然刪除容量是一個(gè)重要的操作做法，但在這種情況下，使用的那款工具允許非常快地刪除大量的容量。我們已修改了此工具，以便更慢地刪除容量，并增加了防范措施，防止任何子系統(tǒng)低于最少所需容量級(jí)別時(shí)被刪除容量。”

AWS已經(jīng)采取的其他值得注意的措施有：它開始致力于將索引子系統(tǒng)的部分劃分到更小的單元。該公司還改變了AWS服務(wù)運(yùn)行狀況儀表板（AWSService Health Dashboard）的管理控制臺(tái)，以便儀表板可以跨多個(gè)AWS區(qū)域運(yùn)行――頗具諷刺意味的是，那個(gè)拼寫錯(cuò)誤在周二導(dǎo)致儀表板失效，于是AWS不得不依靠Twitter，向客戶通報(bào)問題的進(jìn)展。

針對(duì)北弗吉尼亞（US-EAST-1）區(qū)域亞馬遜S3服務(wù)中斷的簡(jiǎn)要說明

我們想為大家透露另外一些信息，解釋2月28日上午出現(xiàn)在北弗吉尼亞（US-EAST-1）區(qū)域的服務(wù)中斷事件。亞馬遜簡(jiǎn)單存儲(chǔ)服務(wù)（S3）團(tuán)隊(duì)當(dāng)時(shí)在調(diào)試一個(gè)問題，該問題導(dǎo)致S3計(jì)費(fèi)系統(tǒng)的處理速度比預(yù)期來得慢。太平洋標(biāo)準(zhǔn)時(shí)（PST）上午9：37，一名獲得授權(quán)的S3團(tuán)隊(duì)成員使用事先編寫的playbook，執(zhí)行一條命令，該命令旨在為S3計(jì)費(fèi)流程使用的其中一個(gè)S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是，輸入命令時(shí)輸錯(cuò)了一個(gè)字母，結(jié)果刪除了一大批本不該刪除的服務(wù)器。不小心刪除的服務(wù)器支持另外兩個(gè)S3子系統(tǒng)。其中一個(gè)系統(tǒng)是索引子系統(tǒng)，負(fù)責(zé)管理該區(qū)域所有S3對(duì)象的元數(shù)據(jù)和位置信息。這個(gè)子系統(tǒng)是服務(wù)所有的GET、LIST、PUT和DELETE請(qǐng)求所必可不少的。第二個(gè)子系統(tǒng)是布置子系統(tǒng)，負(fù)責(zé)管理新存儲(chǔ)的分配，它的正常運(yùn)行離不開索引子系統(tǒng)的正常運(yùn)行。在PUT請(qǐng)求為新對(duì)象分配存儲(chǔ)資源過程中用到布置子系統(tǒng)。刪除相當(dāng)大一部分的容量導(dǎo)致這每個(gè)系統(tǒng)都需要完全重啟。這些子系統(tǒng)在重啟過程中，S3無法處理服務(wù)請(qǐng)求。S3 API處于不可用的狀態(tài)時(shí)，該區(qū)域依賴S3用于存儲(chǔ)的其他AWS服務(wù)也受到了影響，包括S3控制臺(tái)、亞馬遜彈性計(jì)算云（EC2）新實(shí)例的啟動(dòng)、亞馬遜彈性塊存儲(chǔ)（EBS）卷（需要從S3快照獲取數(shù)據(jù)時(shí)）以及AWSLambda。

S3子系統(tǒng)是為支持相當(dāng)大一部分容量的刪除或故障而設(shè)計(jì)的，確保對(duì)客戶基本上沒有什么影響。我們?cè)谠O(shè)計(jì)系統(tǒng)時(shí)就想到了難免偶爾會(huì)出現(xiàn)故障，于是我們依賴刪除和更換容量的功能，這是我們的核心操作流程之一。雖然自推出S3以來我們就依賴這種操作來維護(hù)自己的系統(tǒng)，但是多年來，我們之前還沒有在更廣泛的區(qū)域完全重啟過索引子系統(tǒng)或布置子系統(tǒng)。過去這幾年，S3迎來了迅猛發(fā)展，重啟這些服務(wù)、運(yùn)行必要的安全檢查以驗(yàn)證元數(shù)據(jù)完整性的過程所花費(fèi)的時(shí)間超出了預(yù)期。索引子系統(tǒng)是兩個(gè)受影響的子系統(tǒng)中需要重啟的第一個(gè)。到PST 12：26，索引子系統(tǒng)已激活了足夠的容量，開始處理S3 GET、LIST和DELETE請(qǐng)求。到下午1：18，索引子系統(tǒng)已完全恢復(fù)過來，GET、LIST和DELETE API已恢復(fù)正常。S3 PUT API還需要布置子系統(tǒng)。索引子系統(tǒng)正常運(yùn)行后，布置子系統(tǒng)開始恢復(fù)，等到下午1：54已完成恢復(fù)。至此，S3已正常運(yùn)行。受此事件影響的其他AWS服務(wù)開始恢復(fù)過來。其中一些服務(wù)在S3中斷期間積壓下了大量的工作，需要更多的時(shí)間才能完全恢復(fù)如初。

由于這次操作事件，我們?cè)谧鰩追矫娴淖兓ｋm然刪除容量是一個(gè)重要的操作做法，但在這種情況下，使用的那款工具允許非常快地刪除大量的容量。我們已修改了此工具，以便更慢地刪除容量，并增加了防范措施，防止任何子系統(tǒng)低于最少所需容量級(jí)別時(shí)被刪除容量。這將防止將來不正確的輸入引發(fā)類似事件。我們還將審查其他操作工具，確保我們有類似的安全檢查機(jī)制。我們還將做一些變化，縮短關(guān)鍵S3子系統(tǒng)的恢復(fù)時(shí)間。我們采用了多種方法，讓我們的服務(wù)在遇到任何故障后可以迅速恢復(fù)。最重要的方法之一就是將服務(wù)分成小部分，我們稱之為單元（cell）。工程團(tuán)隊(duì)將服務(wù)分解成多個(gè)單元，那樣就能評(píng)估、全面地測(cè)試恢復(fù)過程，甚至是最龐大服務(wù)或子系統(tǒng)的恢復(fù)過程。隨著S3不斷擴(kuò)展，團(tuán)隊(duì)已做了大量的工作，將服務(wù)的各部分重新分解成更小的單元，減小破壞影響、改善恢復(fù)機(jī)制。在這次事件過程中，索引子系統(tǒng)的恢復(fù)時(shí)間仍超過了我們的預(yù)期。S3團(tuán)隊(duì)原計(jì)劃今年晚些時(shí)候?qū)λ饕酉到y(tǒng)進(jìn)一步分區(qū)。我們?cè)谥匦抡{(diào)整這項(xiàng)工作的優(yōu)先級(jí)，立即開始著手。

從這起事件開始一直到上午11：37，我們無法在AWS服務(wù)運(yùn)行狀況儀表板（SHD）上更新各項(xiàng)服務(wù)的狀態(tài)，那是由于SHD管理控制器依賴亞馬遜S3。相反，我們使用AWS Twitter帳戶（@AWSCloud）和SHD橫幅文本向大家告知狀態(tài)，直到我們能夠在SHD上更新各項(xiàng)服務(wù)的狀態(tài)。我們明白，SHD為我們的客戶在操作事件過程中提供了重要的可見性，我們已更改了SHD管理控制臺(tái)，以便跨多個(gè)AWS區(qū)域運(yùn)行。

最后，我們?yōu)檫@次事件給廣大客戶帶來的影響深表歉意。雖然我們?yōu)閬嗰R遜S3長(zhǎng)期以來在可用性方面的卓越表現(xiàn)備感自豪，但我們知道這項(xiàng)服務(wù)對(duì)客戶、它們的應(yīng)用程序及最終用戶以及公司業(yè)務(wù)來說有多重要。我們會(huì)竭力從這起事件中汲取教訓(xùn)，以便進(jìn)一步提高我們的可用性。

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. ?One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. ?

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. ?The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.

We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.

From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. ?We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.

歡迎加入本站公開興趣群

軟件開發(fā)技術(shù)群

興趣范圍包括：Java，C/C++，Python，PHP，Ruby，shell等各種語言開發(fā)經(jīng)驗(yàn)交流，各種框架使用，外包項(xiàng)目機(jī)會(huì)，學(xué)習(xí)、培訓(xùn)、跳槽等交流

QQ群：26931708

Hadoop源代碼研究群

興趣范圍包括：Hadoop源代碼解讀，改進(jìn)，優(yōu)化，分布式系統(tǒng)場(chǎng)景定制，與Hadoop有關(guān)的各種開源項(xiàng)目，總之就是玩轉(zhuǎn)Hadoop

QQ群：288410967?

云服務(wù)器 GPU云服務(wù)器服務(wù)器系統(tǒng)掛掉原因 aws s3 cdn 什么原因會(huì)導(dǎo)致云計(jì)算故障程序員誤刪數(shù)據(jù)庫

文章版權(quán)歸作者所有，未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請(qǐng)注明本文地址：http://specialneedsforspecialkids.com/yun/4198.html

發(fā)表評(píng)論

登陸后可評(píng)論

0條評(píng)論

MarvinZhang

男|高級(jí)講師

我要關(guān)注我要私信

TA的文章

【小白】線性表的順序存儲(chǔ)結(jié)構(gòu)的實(shí)現(xiàn)（C語言）

閱讀 768·2021-09-26 09:55
廣州主機(jī)號(hào)碼什么開頭-主機(jī)號(hào)是什么？

閱讀 2058·2021-09-22 15:44
fetch 如何請(qǐng)求數(shù)據(jù)

閱讀 1473·2019-08-30 15:54
前端每日實(shí)戰(zhàn)：72# 視頻演示如何用純 CSS 創(chuàng)作氣泡填色的按鈕特效

閱讀 1324·2019-08-30 15:54
用小程序做一個(gè)類似于蘋果AssistiveTouch功能

閱讀 2668·2019-08-29 16:57
移動(dòng)端的vw px rem之間換算

閱讀 517·2019-08-29 16:26
前端每日實(shí)戰(zhàn)：45# 視頻演示如何用純 CSS 創(chuàng)作一個(gè)菱形 loader 動(dòng)畫

閱讀 2490·2019-08-29 15:38
ES6新特性總結(jié) 一

閱讀 2122·2019-08-26 11:48

国产xxxx99真实实拍_久久不雅视频_高清韩国a级特黄毛片_嗯老师别我我受不了了小说

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺(tái)、長(zhǎng)期優(yōu)惠，快來選購！

AWS S3 掛掉原因：程序員輸錯(cuò)字母，誤刪服務(wù)器，故障4小時(shí)！

相關(guān)文章

"打錯(cuò)一個(gè)字母，癱瘓半個(gè)互聯(lián)網(wǎng)" 是怎樣的感受？

北美互聯(lián)網(wǎng)哀鴻遍野 - 號(hào)稱99.9%可用性的S3掛了

阿里云故障「驚魂」1小時(shí)：難道我們是那0.1%？

再見，Python！你好，Go語言

發(fā)表評(píng)論

0條評(píng)論

MarvinZhang

男|高級(jí)講師

TA的文章

【小白】線性表的順序存儲(chǔ)結(jié)構(gòu)的實(shí)現(xiàn)（C語言）

廣州主機(jī)號(hào)碼什么開頭-主機(jī)號(hào)是什么？

fetch 如何請(qǐng)求數(shù)據(jù)

前端每日實(shí)戰(zhàn)：72# 視頻演示如何用純 CSS 創(chuàng)作氣泡填色的按鈕特效

用小程序做一個(gè)類似于蘋果AssistiveTouch功能

移動(dòng)端的vw px rem之間換算

前端每日實(shí)戰(zhàn)：45# 視頻演示如何用純 CSS 創(chuàng)作一個(gè)菱形 loader 動(dòng)畫

ES6新特性總結(jié) 一

最新活動(dòng)