Realtime Database Incident #17094

Service Disruption for Realtime Database

Incident began at 2017-11-27 12:05 and ended at 2017-11-27 13:30 (all times are US/Pacific).

Date Time Description
Nov 29, 2017 14:06

SUMMARY:

All Realtime Databases were inoperable for at least some of the period between 12:00pm and 13:30pm on November 27th. This also affected database panels in the admin console, utilization statistics recorded during the incident, and Hosting deploys.

DETAILED DESCRIPTION OF IMPACT:

The first failures began at 12:00pm. This affected all database read and write operations. At 12:04pm the Firebase on-call began investigating after receiving an alert from our monitoring tools. The on-call declared an incident and began restoring services. The final service was restored at 13:30pm, marking the end of the incident.

In addition to the loss of read/write to Realtime Database instances, this incident rendered database-related features in the Firebase Console inoperable. Other parts of the console were unaffected. Due to a dependency on the Realtime Database, developers were also unable to deploy to Firebase Hosting.

Additionally, database stats during this period accurately displayed abnormally high spikes in utilization (often more than 100% of available capacity), since many servers were overloaded during the incident.

ROOT CAUSE:

A widespread failure in a Cloud data center caused a failure of the Realtime Database's disk persistence layer. Because the reads and writes couldn't be committed to disk, the services eventually became overloaded and were unable to serve traffic. Since Realtime Database does not yet have multi-region redundancy, there was no failover mechanism to mitigate.

REMEDIATION AND PREVENTION:

To reduce the time required to root cause issues like this in the future, we will add additional monitoring for disk failures and configure alerts based on the monitoring.

We launched a full investigation of the service logs to understand the outage and determine additional remediation.

We will continue to work on improved redundancy. For example, we released our next-gen, highly scalable database, Firestore (currently in beta) to address some of these needs.

Firebase Hosting is reducing dependencies on the Realtime Database for mission-critical operations to avoid deploy outages.

Nov 27, 2017 13:30

Services are fully restored. We will investigate root cause and post details when we have more info.

Nov 27, 2017 13:14

Service has been restored for most customers. A small number of servers are still recovering.

Nov 27, 2017 13:03

Services appear to be recovering, but it's early yet. The severity and cause are still under investigation.

Nov 27, 2017 12:20

We are investigating an issue with the Realtime Database. We will provide more information as it becomes available.

All times are US/Pacific
Send Feedback