摘要:周四聲稱,輸錯(cuò)命令導(dǎo)致了亞馬遜網(wǎng)絡(luò)服務(wù)出現(xiàn)持續(xù)數(shù)小時(shí)的故障事件。太平洋標(biāo)準(zhǔn)時(shí)上午,一名獲得授權(quán)的團(tuán)隊(duì)成員使用事先編寫的,執(zhí)行一條命令,該命令旨在為計(jì)費(fèi)流程使用的其中一個(gè)子系統(tǒng)刪除少量服務(wù)器。
AWS解釋了其廣大US-EAST-1地理區(qū)域的S3存儲(chǔ)服務(wù)是如何受到中斷的,以及它在采取什么措施防止這種情況再次發(fā)生。
?
AWS周四聲稱,輸錯(cuò)命令導(dǎo)致了亞馬遜網(wǎng)絡(luò)服務(wù)(AWS)出現(xiàn)持續(xù)數(shù)小時(shí)的故障事件。這起事件導(dǎo)致周二知名網(wǎng)站斷網(wǎng),并給另外幾個(gè)網(wǎng)站帶來了問題。
這家云基礎(chǔ)設(shè)施提供商給出了以下解釋:
亞馬遜簡(jiǎn)單存儲(chǔ)服務(wù)(S3)團(tuán)隊(duì)當(dāng)時(shí)在調(diào)試一個(gè)問題,該問題導(dǎo)致S3計(jì)費(fèi)系統(tǒng)的處理速度比預(yù)期來得慢。太平洋標(biāo)準(zhǔn)時(shí)(PST)上午9:37,一名獲得授權(quán)的S3團(tuán)隊(duì)成員使用事先編寫的playbook,執(zhí)行一條命令,該命令旨在為S3計(jì)費(fèi)流程使用的其中一個(gè)S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是,輸入命令時(shí)輸錯(cuò)了一個(gè)字母,結(jié)果刪除了一大批本不該刪除的服務(wù)器。
這個(gè)錯(cuò)誤無意中刪除了對(duì)US-EAST-1區(qū)域的所有S3對(duì)象而言至關(guān)重要的兩個(gè)子系統(tǒng)――這個(gè)區(qū)域是龐大的數(shù)據(jù)中心地區(qū),恰恰也是亞馬遜歷史最悠久的區(qū)域。兩個(gè)系統(tǒng)都需要完全重啟。亞馬遜特別指出,這個(gè)過程以及運(yùn)行必要的安全檢查“所花的時(shí)間超出了預(yù)期。”
重新啟動(dòng)時(shí),S3無法處理服務(wù)請(qǐng)求。該區(qū)域依賴S3進(jìn)行存儲(chǔ)的其他AWS服務(wù)也受到了影響,包括S3控制臺(tái)、亞馬遜彈性計(jì)算云(EC2)新實(shí)例的啟動(dòng)、亞馬遜彈性塊存儲(chǔ)(EBS)卷(需要從S3快照獲取數(shù)據(jù)時(shí))以及AWSLambda。
亞馬遜特別指出,索引子系統(tǒng)到下午1:18分已完全恢復(fù),而布置子系統(tǒng)在下午1:54分恢復(fù)正常。到那時(shí),S3已正常運(yùn)行。
AWS特別指出,由于這起事件,自己正在“做幾方面的變化”,包括采取將來防止錯(cuò)誤輸入引發(fā)此類問題的措施。
官方博客解釋:“雖然刪除容量是一個(gè)重要的操作做法,但在這種情況下,使用的那款工具允許非常快地刪除大量的容量。我們已修改了此工具,以便更慢地刪除容量,并增加了防范措施,防止任何子系統(tǒng)低于最少所需容量級(jí)別時(shí)被刪除容量。”
AWS已經(jīng)采取的其他值得注意的措施有:它開始致力于將索引子系統(tǒng)的部分劃分到更小的單元。該公司還改變了AWS服務(wù)運(yùn)行狀況儀表板(AWSService Health Dashboard)的管理控制臺(tái),以便儀表板可以跨多個(gè)AWS區(qū)域運(yùn)行――頗具諷刺意味的是,那個(gè)拼寫錯(cuò)誤在周二導(dǎo)致儀表板失效,于是AWS不得不依靠Twitter,向客戶通報(bào)問題的進(jìn)展。
針對(duì)北弗吉尼亞(US-EAST-1)區(qū)域亞馬遜S3服務(wù)中斷的簡(jiǎn)要說明
?
我們想為大家透露另外一些信息,解釋2月28日上午出現(xiàn)在北弗吉尼亞(US-EAST-1)區(qū)域的服務(wù)中斷事件。亞馬遜簡(jiǎn)單存儲(chǔ)服務(wù)(S3)團(tuán)隊(duì)當(dāng)時(shí)在調(diào)試一個(gè)問題,該問題導(dǎo)致S3計(jì)費(fèi)系統(tǒng)的處理速度比預(yù)期來得慢。太平洋標(biāo)準(zhǔn)時(shí)(PST)上午9:37,一名獲得授權(quán)的S3團(tuán)隊(duì)成員使用事先編寫的playbook,執(zhí)行一條命令,該命令旨在為S3計(jì)費(fèi)流程使用的其中一個(gè)S3子系統(tǒng)刪除少量服務(wù)器。遺憾的是,輸入命令時(shí)輸錯(cuò)了一個(gè)字母,結(jié)果刪除了一大批本不該刪除的服務(wù)器。不小心刪除的服務(wù)器支持另外兩個(gè)S3子系統(tǒng)。其中一個(gè)系統(tǒng)是索引子系統(tǒng),負(fù)責(zé)管理該區(qū)域所有S3對(duì)象的元數(shù)據(jù)和位置信息。這個(gè)子系統(tǒng)是服務(wù)所有的GET、LIST、PUT和DELETE請(qǐng)求所必可不少的。第二個(gè)子系統(tǒng)是布置子系統(tǒng),負(fù)責(zé)管理新存儲(chǔ)的分配,它的正常運(yùn)行離不開索引子系統(tǒng)的正常運(yùn)行。在PUT請(qǐng)求為新對(duì)象分配存儲(chǔ)資源過程中用到布置子系統(tǒng)。刪除相當(dāng)大一部分的容量導(dǎo)致這每個(gè)系統(tǒng)都需要完全重啟。這些子系統(tǒng)在重啟過程中,S3無法處理服務(wù)請(qǐng)求。S3 API處于不可用的狀態(tài)時(shí),該區(qū)域依賴S3用于存儲(chǔ)的其他AWS服務(wù)也受到了影響,包括S3控制臺(tái)、亞馬遜彈性計(jì)算云(EC2)新實(shí)例的啟動(dòng)、亞馬遜彈性塊存儲(chǔ)(EBS)卷(需要從S3快照獲取數(shù)據(jù)時(shí))以及AWSLambda。
S3子系統(tǒng)是為支持相當(dāng)大一部分容量的刪除或故障而設(shè)計(jì)的,確保對(duì)客戶基本上沒有什么影響。我們?cè)谠O(shè)計(jì)系統(tǒng)時(shí)就想到了難免偶爾會(huì)出現(xiàn)故障,于是我們依賴刪除和更換容量的功能,這是我們的核心操作流程之一。雖然自推出S3以來我們就依賴這種操作來維護(hù)自己的系統(tǒng),但是多年來,我們之前還沒有在更廣泛的區(qū)域完全重啟過索引子系統(tǒng)或布置子系統(tǒng)。過去這幾年,S3迎來了迅猛發(fā)展,重啟這些服務(wù)、運(yùn)行必要的安全檢查以驗(yàn)證元數(shù)據(jù)完整性的過程所花費(fèi)的時(shí)間超出了預(yù)期。索引子系統(tǒng)是兩個(gè)受影響的子系統(tǒng)中需要重啟的第一個(gè)。到PST 12:26,索引子系統(tǒng)已激活了足夠的容量,開始處理S3 GET、LIST和DELETE請(qǐng)求。到下午1:18,索引子系統(tǒng)已完全恢復(fù)過來,GET、LIST和DELETE API已恢復(fù)正常。S3 PUT API還需要布置子系統(tǒng)。索引子系統(tǒng)正常運(yùn)行后,布置子系統(tǒng)開始恢復(fù),等到下午1:54已完成恢復(fù)。至此,S3已正常運(yùn)行。受此事件影響的其他AWS服務(wù)開始恢復(fù)過來。其中一些服務(wù)在S3中斷期間積壓下了大量的工作,需要更多的時(shí)間才能完全恢復(fù)如初。
由于這次操作事件,我們?cè)谧鰩追矫娴淖兓km然刪除容量是一個(gè)重要的操作做法,但在這種情況下,使用的那款工具允許非常快地刪除大量的容量。我們已修改了此工具,以便更慢地刪除容量,并增加了防范措施,防止任何子系統(tǒng)低于最少所需容量級(jí)別時(shí)被刪除容量。這將防止將來不正確的輸入引發(fā)類似事件。我們還將審查其他操作工具,確保我們有類似的安全檢查機(jī)制。我們還將做一些變化,縮短關(guān)鍵S3子系統(tǒng)的恢復(fù)時(shí)間。我們采用了多種方法,讓我們的服務(wù)在遇到任何故障后可以迅速恢復(fù)。最重要的方法之一就是將服務(wù)分成小部分,我們稱之為單元(cell)。工程團(tuán)隊(duì)將服務(wù)分解成多個(gè)單元,那樣就能評(píng)估、全面地測(cè)試恢復(fù)過程,甚至是最龐大服務(wù)或子系統(tǒng)的恢復(fù)過程。隨著S3不斷擴(kuò)展,團(tuán)隊(duì)已做了大量的工作,將服務(wù)的各部分重新分解成更小的單元,減小破壞影響、改善恢復(fù)機(jī)制。在這次事件過程中,索引子系統(tǒng)的恢復(fù)時(shí)間仍超過了我們的預(yù)期。S3團(tuán)隊(duì)原計(jì)劃今年晚些時(shí)候?qū)λ饕酉到y(tǒng)進(jìn)一步分區(qū)。我們?cè)谥匦抡{(diào)整這項(xiàng)工作的優(yōu)先級(jí),立即開始著手。
從這起事件開始一直到上午11:37,我們無法在AWS服務(wù)運(yùn)行狀況儀表板(SHD)上更新各項(xiàng)服務(wù)的狀態(tài),那是由于SHD管理控制器依賴亞馬遜S3。相反,我們使用AWS Twitter帳戶(@AWSCloud)和SHD橫幅文本向大家告知狀態(tài),直到我們能夠在SHD上更新各項(xiàng)服務(wù)的狀態(tài)。我們明白,SHD為我們的客戶在操作事件過程中提供了重要的可見性,我們已更改了SHD管理控制臺(tái),以便跨多個(gè)AWS區(qū)域運(yùn)行。
最后,我們?yōu)檫@次事件給廣大客戶帶來的影響深表歉意。雖然我們?yōu)閬嗰R遜S3長(zhǎng)期以來在可用性方面的卓越表現(xiàn)備感自豪,但我們知道這項(xiàng)服務(wù)對(duì)客戶、它們的應(yīng)用程序及最終用戶以及公司業(yè)務(wù)來說有多重要。我們會(huì)竭力從這起事件中汲取教訓(xùn),以便進(jìn)一步提高我們的可用性。
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. ?One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. ?
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. The index subsystem was the first of the two affected subsystems that needed to be restarted. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. ?The S3 PUT API also required the placement subsystem. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. At this point, S3 was operating normally. Other AWS services that were impacted by this event began recovering. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover.
We are making several changes as a result of this operational event. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. This will prevent an incorrect input from triggering a similar event in the future. We are also auditing our other operational tools to ensure we have similar safety checks. We will also make changes to improve the recovery time of key S3 subsystems. We employ multiple techniques to allow our services to recover from any failure quickly. One of the most important involves breaking services into small partitions which we call cells. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. During this event, the recovery time of the index subsystem still took longer than we expected. The S3 team had planned further partitioning of the index subsystem later this year. We are reprioritizing that work to begin immediately.
From the beginning of this event until 11:37AM PST, we were unable to update the individual services’ status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3. Instead, we used the AWS Twitter feed (@AWSCloud) and SHD banner text to communicate status until we were able to update the individual services’ status on the SHD. ?We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.
Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.
歡迎加入本站公開興趣群軟件開發(fā)技術(shù)群
興趣范圍包括:Java,C/C++,Python,PHP,Ruby,shell等各種語言開發(fā)經(jīng)驗(yàn)交流,各種框架使用,外包項(xiàng)目機(jī)會(huì),學(xué)習(xí)、培訓(xùn)、跳槽等交流
QQ群:26931708
Hadoop源代碼研究群
興趣范圍包括:Hadoop源代碼解讀,改進(jìn),優(yōu)化,分布式系統(tǒng)場(chǎng)景定制,與Hadoop有關(guān)的各種開源項(xiàng)目,總之就是玩轉(zhuǎn)Hadoop
QQ群:288410967?
文章版權(quán)歸作者所有,未經(jīng)允許請(qǐng)勿轉(zhuǎn)載,若此文章存在違規(guī)行為,您可以聯(lián)系管理員刪除。
轉(zhuǎn)載請(qǐng)注明本文地址:http://specialneedsforspecialkids.com/yun/4198.html
摘要:打錯(cuò)一個(gè)字母癱瘓半個(gè)互聯(lián)網(wǎng)是怎樣的感受在今天亞馬遜披露了這起事故背后的原因后,很多人心里都會(huì)有一個(gè)疑問這個(gè)倒霉的程序員會(huì)被開除嗎關(guān)于這一點(diǎn),雖然主頁君肯定沒法做出準(zhǔn)確的判斷,但還是愿意給出我們的猜測(cè)不會(huì)。 2月28號(hào),號(hào)稱「亞馬遜AWS最穩(wěn)定」的云存儲(chǔ)服務(wù)S3出現(xiàn)超高錯(cuò)誤率的宕機(jī)事件。接著,半個(gè)互聯(lián)網(wǎng)都跟著癱瘓了。一個(gè)字母造成的血案AWS 最近給出了確切的解釋:一名程序員在調(diào)試系統(tǒng)的時(shí)候,運(yùn)...
摘要:當(dāng)和類似的服務(wù)誕生后,對(duì)于很多初創(chuàng)的互聯(lián)網(wǎng)公司,簡(jiǎn)直是久旱逢甘霖,的持久性,和的可用性爽的不能再爽,于是紛紛把自個(gè)的存儲(chǔ)架構(gòu)布在了上。所以,當(dāng)今早主要是宕機(jī)時(shí),整個(gè)北美的互聯(lián)網(wǎng)呈現(xiàn)一片哀魂遍野的景象。 事件回顧美西太平洋時(shí)間早上 10 點(diǎn)(北京時(shí)間凌晨 2 點(diǎn)),AWS S3 開始出現(xiàn)異常。很多創(chuàng)業(yè)公司的技術(shù)人員發(fā)現(xiàn)他們的服務(wù)無法正常上傳或者下載文件。有人在 hacker news 上問:I...
摘要:一場(chǎng)因阿里云故障引發(fā)的突發(fā)事件,導(dǎo)致他所在的互聯(lián)網(wǎng)金融公司幾近癱瘓。此次事故從點(diǎn)分至點(diǎn)分,時(shí)長(zhǎng)約一小時(shí)。對(duì)此,阿里云方面不予置評(píng)。但阿里云相關(guān)負(fù)責(zé)人向新浪科技表示,賠償問題將按照相關(guān)服務(wù)保障條款進(jìn)行處理。 6月27日晚,北京國貿(mào)寫字樓2座燈火通明。林曉宇疾步往返于運(yùn)維部與研發(fā)部的走廊上,表情有些凝重。 一場(chǎng)因阿里云故障引發(fā)的突發(fā)事件,導(dǎo)致他所在的互聯(lián)網(wǎng)金融公司幾近癱瘓。在運(yùn)維部工作近一年,...
摘要:語言誕生于谷歌,由計(jì)算機(jī)領(lǐng)域的三位宗師級(jí)大牛和寫成。作者華為云技術(shù)宅基地鏈接谷歌前員工認(rèn)為,比起大家熟悉的,語言其實(shí)有很多優(yōu)良特性,很多時(shí)候都可以代替,他已經(jīng)在很多任務(wù)中使用語言替代了。 Go 語言誕生于谷歌,由計(jì)算機(jī)領(lǐng)域的三位宗師級(jí)大牛 Rob Pike、Ken Thompson 和 Robert Griesemer 寫成。由于出身名門,Go 在誕生之初就吸引了大批開發(fā)者的關(guān)注。誕生...
閱讀 768·2021-09-26 09:55
閱讀 2058·2021-09-22 15:44
閱讀 1473·2019-08-30 15:54
閱讀 1324·2019-08-30 15:54
閱讀 2668·2019-08-29 16:57
閱讀 517·2019-08-29 16:26
閱讀 2490·2019-08-29 15:38
閱讀 2122·2019-08-26 11:48