Realtime Database Incident #18002

Elevated latencies and intermittent timeouts

Incident began at 2018-01-16 05:40 and ended at 2018-01-16 10:10 (all times are US/Pacific).

	Date	Time	Description
	Jan 17, 2018	15:09	SUMMARY: Between 5:20 a.m. PST (UTC -8) and 10:10 a.m. projects experienced consistent latencies and intermittent downtime due to a failed service. DETAILED DESCRIPTION OF IMPACT: At 5:20 a.m. some of our most heavily loaded customers experienced elevated latencies and intermittent timeouts due to an overloaded internal service. About an hour later, at 6:10 a.m., the internal service suffered a critical failure, resulting in downtime for all Firebase projects. A restart of the internal service provided temporary mitigation, however, the service continued to degrade while we investigated a root cause, resulting in minor latencies and a second outage for a number of projects during the span of 9:50 a.m. to 10:10 a.m.. ROOT CAUSE: At a distant point in the past, we moved a critical path system from a single service (a possible single point of failure) to a multi-redundancy master/slave architecture. However, a code dependency was missed during the transition and a critical path function still relied on the master instance instead of the redundant cluster. The master instance became overloaded due to memory pressures and eventually failed entirely. REMEDIATION AND PREVENTION: Patched server process that caused memory pressure. Deployed fix for faulty code that did not properly use redundant cluster. Added additional alerting for detecting memory issues in this cluster. Since memory pressure resulted from a gradual build up, we added metrics to monitor trends over more substantial periods than just month-to-month changes. Performed cross-team investigation of possible networking failures that may have complicated this outage.
	Jan 16, 2018	10:33	We have identified a repeating fault with a critical service with widespread impact, and have reconfigured around it. We are hoping services will return to normal now.
	Jan 16, 2018	07:07	The issue has been resolved as of 06:39 AM US/Pacific. Apologies for the inconvenience this has caused you. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.
	Jan 16, 2018	06:26	We're observing increased latency that started around 5:40 AM US/Pacific that affects all customers. This also affects Cloud Functions deploy to Realtime Database. We will provide more information as soon as possible.

All times are US/Pacific