After experiencing a massive DNS-related outage on April Fools’ day, Microsoft has since made a full recovery and released a detailed incident report explaining what went wrong.
Spikes in DNS Queries Overwhelm Azure’s System
According to Azure, its servers received an atypical surge of global DNS queries that were directed at some of the domains hosted on its platform. The company claims that its layers of caches would usually be able to handle incidents of this nature, but a specific sequence during the event revealed a code defect that allowed the system to be overloaded. The overload was largely due to the number of retry requests made by DNS clients, which was still considered genuine DNS traffic by Azure’s mitigation systems.
Microsoft Scrambles to Fix Code Defect After Outage
Since Microsoft domains, including Xbox Live and Office 365, use Azure’s DNS, the entire product line was affected to one degree or another. For preventative measures, Microsoft is working to correct the code defect and improving the way it monitors usual traffic.
It still isn’t clear what the surge in queries was, though many sites have speculated that it could have been a DDoS attack against the targeted domains. Whatever the reason, it’s apparent that the Microsoft team has their work cut out for them, considering the company’s history of outages in 2020 alone.
Looking to the Future: DNS Problems and Solutions
Even if Microsoft does manage to improve its systems, one has to wonder why domains still rely on Azure DNS as their sole provider in the first place—or even why Microsoft does. With Secondary DNS, incidents like this can easily be avoided. Rather than creating services and cloud environments that purposefully or inadvertently lead to vendor lock-in, companies should be putting the needs of their clients and their client’s customers first.
Infrastructure, peering, and transit capacity play a major role in the ability to keep DNS functioning smoothly and efficiently. But as Microsoft’s recent outage demonstrates, it takes even more than that. One huge downside to companies that provide multiple cloud services is their lack of dedicated DNS management. With all services running on the same network, performance suffers, and the likelihood of servers becoming overloaded increases.
As customers continue to voice their dissatisfaction when outages occur, it is our hope that our industry will begin working to improve internet experiences as a whole, for everyone, and not just themselves. No matter how big a provider is, it is impossible to carry the weight of the entire world’s domains on its infrastructure—and to pretend otherwise is just ridiculous.