Over the years, we have learned to see the digital ‘cloud’ as the glue that binds all of our important files and digital accessories. This includes our mobile apps, email access, and access controls for the organization network altogether. We thought that the cloud never stops working, right?
A global Azure outage earlier this year proved to be a gruelling experience for those working at Microsoft as well as the organizations or individuals utilizing its services. How exactly did this happen? What are the implications and dangers that lie ahead for businesses worldwide if such outages frequent the cloud computing platform? In this article, we'll take a deeper look at these questions and try to figure out what the answers really indicate.
Microsoft has claimed that a 'rotation of keys' that manage login and authentication was responsible for a 14-hour Azure outage caused on March 15, 2021. This global Azure outage affected Office 365, Dynamics 365, Xbox Live, and several other Microsoft services.
Soon afterward, Microsoft amended its service health status page, confirming that users may be having trouble using the company's major online collaborative project, networking, and productivity capabilities. Microsoft Teams went offline for several hours as a result of this outage.
A normal rotation of the security keys for Azure Active Directory (Azure AD) started the nearly 14-hour outage. These key rotations are beneficial to users and keep them secure.
Microsoft, on the other hand, was completing a difficult data move between cloud providers on this particular day. They had designated one key as "don't rotate." This key had to be kept for the time being in order to complete the migration. However, the automated rotation mechanism overlooked the "don't rotate" indication, and individuals were unable to log in when the new keys arrived at the Azure services.
Microsoft users who were affected by the mishap expressed their annoyance, with many asking why the company would upgrade its authentication mechanism during the workweek rather than on the weekend when the danger of business interruption to users would be minimal.
Microsoft has released a tentative root cause investigation of the Azure Active Directory outage that occurred on March 15th. For encrypted signature activities, Azure AD uses 'certain 'keys to facilitate the usage of OpenID as well as other Identity security standards. An automatic mechanism removes keys that are out of use on a time-to-time basis as part of Azure's routine security hygiene.
A certain key had been tagged as "retain" for considerably longer than usual in the previous few weeks to enable a difficult cross-cloud transfer. This revealed a fault in which the automation improperly omitted the "retain" state, causing the key to be removed.
Azure AD publishes metadata about the signing authentication keys to a worldwide location in accordance with Universal Internet Identity standard protocols. Applications utilizing these protocols with Azure AD started to pick up the new metadata when the public metadata was modified at 19:00 UTC on March 15, 2021. Furthermore, it ceased validating tokens/assertions issued with the key that was withdrawn. End users were no longer allowed to use those programs at that moment.
Countless defects and errors in the form of 'bugs' will always be present in any kind of software. And no cloud technology service can guarantee zero downtime. However, you and I may aspire for faster recovery times and greater clarity about what went down at Microsoft's Azure earlier this year - and what is currently being done to prevent future problems.
Microsoft, for one, is adopting certain strategies in order to prevent future interruptions. As stated in the Azure issue reporting the outage, "Azure AD is working on adding extra protections to the backend Safe Deployment Process (SDP) system in a multi-phase effort to prevent a class of vulnerabilities, including this issue."
Companies that rely on Windows VMs, including Azure DevOps, may have been impacted as well, according to the status page.
There are certain business implications as a result of any cloud outage on such a large scale. Some of these are as follows.
1) The contemporary digital milieu does not allow for poor customer interactions. As a result, individuals have come to demand a consistently high quality of service as a result of modern technologies.
2) Because of the 'always on' slogan, any disruption in day-to-day cloud services has a negative impact on consumer trust.
3) With far too many choices, flexibility, and even benefits of switching cloud service providers, interruptions may drive consumers to rivals, so businesses can no longer afford to take a band-aid approach to a gaping hole.
While the crisis of losing network connection might result in an instant loss of productivity and opportunity, cloud system downtime can have a longer-term negative impact on a specific business. Platforms with this much sophistication such as Microsoft Azure should include an ample number of bug-catching features and capabilities. Although there are much more advantages to cloud computing than disadvantages, outages can occur.
As enterprises continue to embrace digital transformation, IT systems will become increasingly complex, prompting the deployment of efficient monitoring options to overcome the hurdles.