November 27, 2018
Azure has been considered as one of the best cloud computing solutions out there and trusted by 85% of the Fortune 500 companies [1]. In fact, after AWS, it is the most used cloud platform across the globe. Therefore, when an outage of one of the more popular functionalities of the platform occurred last week, it was a major inconvenience to many.
On November 19th, many users were locked out of their own accounts on Azure as well as some other Microsoft accounts, due to an unknown glitch.[2]
When they tried to login to their account, they were asked to complete a second round of authentication, multi-factor authentication or MFA, through a security code sent through text message or push notification. However, when users were directed to the password webpage, neither code nor any notification of their connected accounts or devices was sent. Later, a notice was put on the Office 365 service health page, stating that, “Affected users may be unable to sign in.”
Now a week later, the Azure team at Microsoft has opened up to the world about the fundamental cause of the error. [3]
In actuality, Microsoft revealed three hidden drivers that rendered users unable to log into their Azure, Office 365, Dynamics and other Microsoft accounts.
The outage that affected the platforms for 14 hours was affected by three different issues. The primary factor was in the functionality’s front-end communication to the cache. Another element that caused delay in the recovery efforts was the race condition that occurred in the back-end. It further prevented servers from processing the appropriate responses. After fixing these two issues, the outage was successfully eradicated later that day.
It was only in yesterday’s report that the team announced the presence of a third factor [4]. The report said, “The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.”
To mitigate the issue, a number of steps were taken by the team, and this includes adding more capacities at data centers and increasing throttling limits. However, these efforts only resolved a part of the issue, compelling engineers to work on the root of the issue and once these issues were found, they successfully eliminated all of the issues.
To prevent such cases in future, the team is also working on some post-recovery methods. This includes reviewing deployment procedures to identify similar issues during testing and even development cycles. The team will continue to do this until December 2018. Some of the other methods that the team will pursue include reviewing monitoring services, containment processes, and updating communication processes to the monitoring tools and Service Health Dashboard. For more updates on the issue and the service itself, customers are encouraged to keep posted and check the official maintenance notification servitude of the company. [5]
Resources: