Realtime Database Incident #17051

Issue with Realtime Database

Incident began at 2017-06-02 08:39 and ended at 2017-06-02 10:30 (all times are US/Pacific).

Date Time Description
Jun 20, 2017 13:23

ISSUE SUMMARY

Starting 14:22 PDT on Tuesday 30 May, 2017, Firebase Realtime Database returned incomplete data for incremental updates for a duration of up to 2 days, 18 hours and 51 minutes. Other types of queries were not affected. If your service or application was affected, we apologize – this is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. We have conducted an internal investigation and are taking steps to improve our service.

DETAILED DESCRIPTION OF IMPACT

Firebase Realtime Database returned incomplete update data, and clients that typically render data as arrays would instead render them as hashmaps. Specifically, mobile SDK applications with offline data persistence enabled broke due to the incorrect data format. We estimate less than 5% of users were adversely affected by this bug.

The impact began between at different times for different customers. The first customers were impacted on 14:22 PDT Tuesday 30 May by 13:20 PDT Thursday 1 June, all customers relying on array rendering were impacted. Beginning on 08:19 PDT Friday 2 June, the percentage of customers impacted began dropping linearly until 09:13 PDT Friday 2 June, 2017, when 0% of customers were impacted.

ROOT CAUSE

We split larger chunks of data when we save it on disk. We made optimizations to the way we combined data upon reading the splits, however when doing this we sorted the data incorrectly. We normally sort by lexicographical string ordering, but special case integers and sort them before the strings in numerical order. We forgot this special case which had unintended effects of incorrectly sending partial results of arrays. These partial results only get sent in our mobile SDKs when offline persistence is enabled, which our testing infrastructure did not cover.

Since this change was tied to a build and was not behind a flag, the issue would immediately manifest for all databases on a single server when it was deployed to.

REMEDIATION AND PREVENTION

One customer filed a report at 08:38 PDT Thursday 1 June, which was triaged by the Firebase support team for troubleshooting. At the time, there was no evidence that the issue affected additional customers. After additional customers noticed the issue starting at 04:45 PDT Friday 2 June, 2017, it was escalated to the on-call engineer at 08:14. The engineer immediately began a phased rollback to the previous build. The rollback was fully completed 59 minutes later at 09:13 PDT Friday 2 June, 2017, when all customers were receiving nominal data again.

To prevent this in the future, we are increasing our unit test coverage to include testing of sorting behavior, and are working on additional mobile integration tests to more extensively catch mobile specific edge cases. We will be increasing our mandatory rollout period to catch these issues before they go out to all customers. We will also improve our support procedures to more accurately characterize widespread customer issues and ensure prompt escalation to on-call engineers.

Google is committed to quickly and continually improving our technology and operations to prevent service disruptions. We appreciate your patience and apologize again for the impact to your organization. We thank you for your business.

Jun 02, 2017 10:30

The issue has been resolved. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Jun 02, 2017 08:39

We are still investigating an issue with Realtime Database. We will provide more information by 10:30AM US/Pacific.

All times are US/Pacific
Send Feedback