Hosting Incident #18002
Issue with Hosting Serving Traffic
Incident began at 2018-02-01 01:45 and ended at 2018-02-01 07:37 (all times are US/Pacific).
|Feb 06, 2018||17:18||
Firebase Hosting was partially to completely unavailable for serving traffic between 1:00am and 7:30am Pacific time on February 1st. This outage also affected non-email/password sign-in methods of Firebase Auth for the majority of users.
DETAILED DESCRIPTION OF IMPACT:
Just before 1:00am on February 1, the CPU load of Firebase Hosting's origin servers increased to 100% across the board. At 1:19 performance had begun to degrade and the oncall engineer received an alert. By 1:55am three Firebase Hosting engineers were investigating the disruption to find potential mitigations.
This affected between 5-20% of all Firebase Hosting traffic, as requests for cached CDN content were unaffected by the outage. However, all sites without large amounts of traffic, that had deployed recently, or with large numbers of URLs were likely suffering partial to total outages during the incident timeframe.
After several mitigation attempts, services were restored to 100% at approximately 7:30 a.m. The timeline of affected traffic is as follows:
01:00-01:30 - Load increases and origin servers begin to get unhealthy.
01:30-04:00 - All or nearly all origin traffic is disrupted by CPU load
04:00-04:30 - 50% of origin traffic is restored, provisioning of new VMs begins
06:30-07:00 - 60% of origin traffic is restored as new VMs are brought into rotation
07:00-07:30 - 80% of origin traffic is restored
07:30 - 100% of traffic is restored, incident ends
Origin traffic and CPU load had been increasing since January 21. This went unnoticed by the Firebase Hosting team due to insufficient internal monitoring of peak load outside of business hours. On February 1st core services began to fail as a result of increased load. Unhealthy servers were taken out of rotation by the load balancer, increasing the load on the remaining servers and causing cascading failures that could not be solved without reduced load or increased capacity.
REMEDIATION AND PREVENTION:
This was the first incident of a load-based failure of the Hosting origin servers, and the symptoms of the problem were not clearly diagnosed during the early phases of investigation.
We are taking a number of steps to prevent incidents like this from occurring in the future:
|Feb 01, 2018||07:37||
Service has been restored to normal operation. We are carefully monitoring the system to ensure stability, but believe the incident has been successfully managed.
|Feb 01, 2018||07:15||
Mitigation is starting to take hold. System-wide error rates have dropped to around 5%, and we are working to getting it back to full operation in the next hour.
|Feb 01, 2018||06:40||
We are in the process of trying a new mitigation strategy. We will have an update on its success by 7:15 AM.
|Feb 01, 2018||04:50||
Mitigation work is ongoing. We do not have an estimate yet on when the service will be fully restored.
|Feb 01, 2018||02:30||
We are experiencing an issue with Firebase Hosting where 503/504 errors are being returned upon accessing the domains. For everyone who is affected, we apologize for any inconvenience you may be experiencing.
|Feb 01, 2018||02:07||
We're investigating an issue with Firebase Hosting and will provide more information by 2:30 AM.