The Hub Outage

Downtime

Apr 11 at 10:26am EDT

Affected services

Main Service

Resolved
Apr 12 at 12:53pm EDT

The following is the incident report for the outage on April 11, 2022. I understand this service issue has impacted our operation and students' services, and I apologize to everyone affected.

Issue Summary
From 10:26 AM to 6:30 PM EST, requests to most of The Hub pages resulted in 503 error response messages. At its peak, the issue affected 100% of traffic to The Hub. The root cause of this outage was database ran out of resources to process incoming queries.

Timeline (EST)
10:26 AM: Outage begins
10:30 AM: Issue noticed, and started searching for the cause.
11:00 AM: Rollback to the lasted known version of The Hub
12:10 PM: The problem persists. Continue searching for the cause.
12:20 PM: Restart the server
12:30 PM: The problem persists. Continue searching for the reason.
3:00 PM: Identity System running out of resources
3:10 PM: Identity MariaDB service is the cause of system overloading
3:20 PM: Check all running queries running on the system
4:30 PM: Kill all MariaDB service processes and only allow certain users to access The Hub
5:30 PM: Remove the questionable view table; materialize its table content, and index the new table
6:00 PM: Push the latest version to the server
6:30 PM: 100% of traffic back online

Root Cause
The Hub system restores all students' data within one table named classes, and a view table summarizes all students from 2005. Multiple APIs and pages rely on the Students view table to render and search students' names and records. The process uses many resources to process each query, from page rendering to API access. When a certain amount of traffic accesses a page or API witch uses the view table, the system will use many server resources. With only 2GB RAM for the server, the system ran out of resources and denied any internet requests.

Resolution and recovery
At 10:30 AM EST, the system developer noticed an access problem with The Hub system and investigated and quickly escalated the issue. At 11:00 AM, I attempted to roll back to a previous version since the system had a patch updated earlier in the morning. This rollback failed since the problem didn't cause by the update. At 3:00 PM, we noticed that no MariaDB service was used, almost total system usage, and processes would not finish. We killed all the SQL query processes, and The Hub system was back online for 5 minutes. At 3:20 PM, I checked which query or function was causing the problem and restricted only developers and access to the service. Around 5:00 PM, I identified all questionable questions and removed all related code. Then, the system resources recovered, and the system was back online. At 6:00 PM, I pulled the dubious view table, materialized its content, and indexed a new table. After pushing the lasted version back to the server, 100% of traffic returned online at 6:30 PM.

Corrective and Preventative Measures
The following are actions I am taking to address the underlying causes of the issue and to help prevent recurrence and improve response times:
Materialize all students' data in the database to reduce database system usage. (Completed.)
Add more SQL query size. (Completed.)
Add a secondary server to switch when a server fails.
Develop a better students information database structure to handle queries with fewer resources
Update the system framework to the newest version to manage the overloading problem better

Updated
Apr 11 at 06:30pm EDT

All service is back to normal. We will provide an incident report tomorrow (04/12/2022).

Created
Apr 11 at 10:26am EDT

We are experiencing downtime for the primary server, and our team is working on a solution to restore the service.