Firebase Status Dashboard

This page provides status information on the services that are part of Firebase. Check back here to view the current status of the services listed below. If you are experiencing an issue not listed here, please contact Support. Learn more about what's posted on the dashboard in this FAQ. For additional information on these services, please visit https://firebase.google.com/.

NOTE

For incidents related to Cloud Functions, Cloud Firestore and Cloud Storage, please see Cloud Status Dashboard. And for incidents related to Google Analytics, please see Ads Status Dashboard.


Incident affecting Test Lab

Customer tests are failing on ARM virtual devices

Incident began at 2023-05-02 11:15 and ended at 2023-05-03 16:30 (all times are US/Pacific).

Date Time Description
11 May 2023 08:26 PDT

SUMMARY

Firebase Test Lab (FTL) recently had an incident that resulted in failed tests, inconclusive error rates, and extended wait times for tests targeting Arm AVDs. Arm AVDs¹ are a Beta release² from Firebase Test Lab that provide benefits like faster execution time. The incident occurred for approximately 30 hours. The incident was exacerbated by mis-allocated network sockets uncovered during the investigation.

IMPACT

Between May 2, 7:48am PT (14:48 UTC) and May 3, 2:30pm PT (21:30 UTC), users of Arm devices observed instances of test failures (flakiness) and inconclusive test results. This issue only impacted customers using the Arm virtual devices.

ROOT CAUSE

When tests are run, the per-test resources are typically garbage collected. We uncovered a regression in the device cleanup logic where these resources were being held.

CPU starvation and memory swapping induced by this memory leak led to competition for resources, which slowed emulators and caused random test failures. The memory leak was not observable until the following occurred:

  • We reached a level of usage that exposed the latent memory leak.
  • We skipped a release, leaving more time for the memory leak to accumulate.

During this investigation, we uncovered a secondary issue: we had mis-allocated one of the sockets intended for use by the virtual machines (VM). This caused Arm emulators to fail to boot, making it challenging to mitigate the issue during the incident.

REMEDIATION

On May 2, 2023, we initially believed this issue was triggered by over-provisioning of emulators per a host and overuse of CPU and memory. To prevent this, we reduced the number of emulators running concurrently per a host. This mitigated the problem until the next day (May 3, 2023), when we continued to observe test failures while traffic continued to increase on Arm devices. At this point, we identified the root cause and mitigated it by performing a rolling restart across all machines. We will provide an update once the fix is made and the issue is resolved.

Until the fix is in production, we are monitoring heap memory and regularly restarting these processes.

PREVENTION

  • We validated and committed, but not yet productionized, the fix for the memory leak and for the mis-allocated sockets.
  • We will invest in additional release qualification, load testing, and validation tooling to reduce the likelihood of incidents, and we will add additional alerting to help us find problems before they become outages. This will include alerts for memory-related problems.
  • To ensure sufficient redundancy moving forward, we are planning increased Arm device capacity before graduating the Arm AVD Beta.

¹ https://firebase.google.com/docs/test-lab/android/avds

² https://firebase.google.com/support/releases#august_18_2022

3 May 2023 16:30 PDT

After resolving known issues the error rates for Arm virtual tests have returned to normal since 3pm PT. We apologize for any inconvenience this has caused.

3 May 2023 14:04 PDT

We fixed one of the problems affecting Arm virtual devices, but are still seeing high-inconclusive error rates. The team is actively investigating. The available capacity of Arm virtual devices remains reduced. Customers seeing long delays or high failure rates are encouraged to use x86 virtual devices in the meantime. Next update May 3rd 5pm PT

2 May 2023 18:45 PDT

We put mitigations in place that reduced the observed failure rates, but some customers can still experiencing elevated test failures. The available capacity of Arm virtual devices is currently reduced. Customers seeing long delays are encouraged to use x86 virtual devices in the meantime. Next update May 03rd 10am PT.

2 May 2023 11:44 PDT

Customer tests are failing on ARM virtual devices