Cloud Messaging Incident #18015

FCM high latency issue

Incident began at 2018-09-06 12:00 and ended at 2018-09-07 21:30 (all times are US/Pacific).

Date Time Description
Oct 17, 2018 10:27

SUMMARY

FCM message senders (using both XMPP and HTTP protocols) experienced abnormally high error rates and latency when communicating with FCM servers for sends and upstream messages. Other interactions with FCM, such as device and topic registration, were unaffected.

DETAILED DESCRIPTION OF IMPACT

Low % impacted (<0.1% of traffic): 29 August 2018 - 19 September 2018

High impact (at least 0.1% of traffic, with peaks to about 8% of traffic): 2018-08-30 15:30 through 2018-08-31 11:00 2018-09-06 17:30 through 2018-09-07 21:30 2018-09-11 17:15 through 2018-09-14 21:00

ROOT CAUSE

Recent updates to the FCM software stack caused unexpectedly high server resource usage. That, combined with a bug that caused server load to randomly spike, created consistently poor performance for a small but significant percentage of FCM traffic. Due to the random nature of the spikes, it took longer than usual for FCM engineers to diagnose the issue.

REMEDIATION AND PREVENTION

During the first two of the three high impact periods, FCM engineers were able to relieve the effects to a substantial degree by providing temporary extra capacity to the FCM services. Since then, FCM servers have been granted permanent additional compute capacity.

The third period of high impact was caused by a high proportion of requests of a specific type that were likely to trigger the bug that caused server resource usage to spike. FCM engineers isolated these problematic requests into dedicated server pools (completed about 2018-09-14 14:00), and then identified and resolved the bug (completed about 2018-09-14 20:00).

FCM engineers are pursuing preventative measures, such as improved load management and balancing, to make sure this is less likely to occur in the future.

Sep 07, 2018 21:30

The latency issue with FCM should have been resolved for all affected users as of 9:30 PM US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence.

Sep 07, 2018 19:00

The latency issue with FCM should be resolved for the majority of users, and we expect a full resolution in the near future. We will provide another status update by 10:00AM US/Pacific with current details.

Sep 07, 2018 14:00

We are still investigating the latency issue with FCM. We will provide an update as soon as possible.

Sep 07, 2018 10:15

FCM latency has improved, but it's not fully recovered yet. We’re still working on identifying the root cause of the issue. We’ll provide another update by 2:00 PM PST.

Sep 07, 2018 07:37

Rollback to earlier state has completed. FCM is currently receiving a high volume of traffic which is impacting latency; we are bringing online additional capacity to handle the increased traffic. Next update at 10:00 AM PST.

Sep 07, 2018 05:00

Currently rolling back to an earlier state. Next status update at 07:00 US/Pacific.

Sep 07, 2018 02:58

Latency remains elevated. We are continuing to rollout a change to mitigate the issue. Next status update at 05:00 US/Pacific.

Sep 07, 2018 01:08

We have identified an issue that is the cause of much of the increased latency. We are in the process of rolling out a fix. Next status update will be posted at 03:00 US/Pacific.

Sep 06, 2018 23:03

We are still investigating the issue with FCM message acceptance. Current data indicates that as much as 0.15% of requests are affected by this issue. We will provide another status update by 01:00 US/Pacific with current details.

Sep 06, 2018 21:37

We are experiencing a high latency issue with acceptance of FCM messages beginning at 2018-09-06 12:00 US/Pacific.

For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 23:00 US/Pacific.

All times are US/Pacific
Send Feedback