Firebase Status Dashboard
NOTE
For incidents related to Cloud Functions, Cloud Firestore and Cloud Storage, please see Cloud Status Dashboard. And for incidents related to Google Analytics, please see Ads Status Dashboard.
Incident affecting Test Lab
Customer tests are failing on ARM virtual devices
Incident began at 2023-05-02 11:15 and ended at 2023-05-03 16:30 (all times are US/Pacific).
Date | Time | Description | |
---|---|---|---|
| 11 May 2023 | 08:26 PDT | SUMMARY Firebase Test Lab (FTL) recently had an incident that resulted in failed tests, inconclusive error rates, and extended wait times for tests targeting Arm AVDs. Arm AVDs¹ are a Beta release² from Firebase Test Lab that provide benefits like faster execution time. The incident occurred for approximately 30 hours. The incident was exacerbated by mis-allocated network sockets uncovered during the investigation. IMPACT Between May 2, 7:48am PT (14:48 UTC) and May 3, 2:30pm PT (21:30 UTC), users of Arm devices observed instances of test failures (flakiness) and inconclusive test results. This issue only impacted customers using the Arm virtual devices. ROOT CAUSE When tests are run, the per-test resources are typically garbage collected. We uncovered a regression in the device cleanup logic where these resources were being held. CPU starvation and memory swapping induced by this memory leak led to competition for resources, which slowed emulators and caused random test failures. The memory leak was not observable until the following occurred:
During this investigation, we uncovered a secondary issue: we had mis-allocated one of the sockets intended for use by the virtual machines (VM). This caused Arm emulators to fail to boot, making it challenging to mitigate the issue during the incident. REMEDIATION On May 2, 2023, we initially believed this issue was triggered by over-provisioning of emulators per a host and overuse of CPU and memory. To prevent this, we reduced the number of emulators running concurrently per a host. This mitigated the problem until the next day (May 3, 2023), when we continued to observe test failures while traffic continued to increase on Arm devices. At this point, we identified the root cause and mitigated it by performing a rolling restart across all machines. We will provide an update once the fix is made and the issue is resolved. Until the fix is in production, we are monitoring heap memory and regularly restarting these processes. PREVENTION
¹ https://firebase.google.com/docs/test-lab/android/avds ² https://firebase.google.com/support/releases#august_18_2022 |
| 3 May 2023 | 16:30 PDT | After resolving known issues the error rates for Arm virtual tests have returned to normal since 3pm PT. We apologize for any inconvenience this has caused. |
| 3 May 2023 | 14:04 PDT | We fixed one of the problems affecting Arm virtual devices, but are still seeing high-inconclusive error rates. The team is actively investigating. The available capacity of Arm virtual devices remains reduced. Customers seeing long delays or high failure rates are encouraged to use x86 virtual devices in the meantime. Next update May 3rd 5pm PT |
| 2 May 2023 | 18:45 PDT | We put mitigations in place that reduced the observed failure rates, but some customers can still experiencing elevated test failures. The available capacity of Arm virtual devices is currently reduced. Customers seeing long delays are encouraged to use x86 virtual devices in the meantime. Next update May 03rd 10am PT. |
| 2 May 2023 | 11:44 PDT | Customer tests are failing on ARM virtual devices |
- All times are US/Pacific