Hosting Incident #18002

Issue with Hosting Serving Traffic

Incident began at 2018-02-01 01:45 and ended at 2018-02-01 07:37 (all times are US/Pacific).

	Date	Time	Description
	Feb 06, 2018	17:18	SUMMARY: Firebase Hosting was partially to completely unavailable for serving traffic between 1:00am and 7:30am Pacific time on February 1st. This outage also affected non-email/password sign-in methods of Firebase Auth for the majority of users. DETAILED DESCRIPTION OF IMPACT: Just before 1:00am on February 1, the CPU load of Firebase Hosting's origin servers increased to 100% across the board. At 1:19 performance had begun to degrade and the oncall engineer received an alert. By 1:55am three Firebase Hosting engineers were investigating the disruption to find potential mitigations. This affected between 5-20% of all Firebase Hosting traffic, as requests for cached CDN content were unaffected by the outage. However, all sites without large amounts of traffic, that had deployed recently, or with large numbers of URLs were likely suffering partial to total outages during the incident timeframe. After several mitigation attempts, services were restored to 100% at approximately 7:30 a.m. The timeline of affected traffic is as follows: 01:00-01:30 - Load increases and origin servers begin to get unhealthy. 01:30-04:00 - All or nearly all origin traffic is disrupted by CPU load 04:00-04:30 - 50% of origin traffic is restored, provisioning of new VMs begins 06:30-07:00 - 60% of origin traffic is restored as new VMs are brought into rotation 07:00-07:30 - 80% of origin traffic is restored 07:30 - 100% of traffic is restored, incident ends ROOT CAUSE: Origin traffic and CPU load had been increasing since January 21. This went unnoticed by the Firebase Hosting team due to insufficient internal monitoring of peak load outside of business hours. On February 1st core services began to fail as a result of increased load. Unhealthy servers were taken out of rotation by the load balancer, increasing the load on the remaining servers and causing cascading failures that could not be solved without reduced load or increased capacity. REMEDIATION AND PREVENTION: This was the first incident of a load-based failure of the Hosting origin servers, and the symptoms of the problem were not clearly diagnosed during the early phases of investigation. We are taking a number of steps to prevent incidents like this from occurring in the future: Provision additional capacity to ensure that our capacity well exceeds peak load. Continue migrating our deployment and serving infrastructure to make better use of internal infrastructure available at Google with an eye toward better management and monitoring of peak load conditions. Improve monitoring and growth alerting for peak origin traffic, paying specific attention to peaks that happen outside business hours. Improve processes and tooling for provisioning new VMs, allowing faster response times in the future. Investigate performance improvements to keep CPU pressure off of our origin servers.
	Feb 01, 2018	07:37	Service has been restored to normal operation. We are carefully monitoring the system to ensure stability, but believe the incident has been successfully managed.
	Feb 01, 2018	07:15	Mitigation is starting to take hold. System-wide error rates have dropped to around 5%, and we are working to getting it back to full operation in the next hour.
	Feb 01, 2018	06:40	We are in the process of trying a new mitigation strategy. We will have an update on its success by 7:15 AM.
	Feb 01, 2018	04:50	Mitigation work is ongoing. We do not have an estimate yet on when the service will be fully restored.
	Feb 01, 2018	02:30	We are experiencing an issue with Firebase Hosting where 503/504 errors are being returned upon accessing the domains. For everyone who is affected, we apologize for any inconvenience you may be experiencing.
	Feb 01, 2018	02:07	We're investigating an issue with Firebase Hosting and will provide more information by 2:30 AM.

All times are US/Pacific