Cloud Messaging Incident #18016

FCM latency issue

Incident began at 2018-09-11 14:53 and ended at 2018-09-14 20:30 (all times are US/Pacific).

	Date	Time	Description
	Oct 17, 2018	10:27	SUMMARY FCM message senders (using both XMPP and HTTP protocols) experienced abnormally high error rates and latency when communicating with FCM servers for sends and upstream messages. Other interactions with FCM, such as device and topic registration, were unaffected. DETAILED DESCRIPTION OF IMPACT Low % impacted (<0.1% of traffic): 29 August 2018 - 19 September 2018 High impact (at least 0.1% of traffic, with peaks to about 8% of traffic): 2018-08-30 15:30 through 2018-08-31 11:00 2018-09-06 17:30 through 2018-09-07 21:30 2018-09-11 17:15 through 2018-09-14 21:00 ROOT CAUSE Recent updates to the FCM software stack caused unexpectedly high server resource usage. That, combined with a bug that caused server load to randomly spike, created consistently poor performance for a small but significant percentage of FCM traffic. Due to the random nature of the spikes, it took longer than usual for FCM engineers to diagnose the issue. REMEDIATION AND PREVENTION During the first two of the three high impact periods, FCM engineers were able to relieve the effects to a substantial degree by providing temporary extra capacity to the FCM services. Since then, FCM servers have been granted permanent additional compute capacity. The third period of high impact was caused by a high proportion of requests of a specific type that were likely to trigger the bug that caused server resource usage to spike. FCM engineers isolated these problematic requests into dedicated server pools (completed about 2018-09-14 14:00), and then identified and resolved the bug (completed about 2018-09-14 20:00). FCM engineers are pursuing preventative measures, such as improved load management and balancing, to make sure this is less likely to occur in the future.
	Sep 14, 2018	20:32	Our solution has brought latency and error rates for all users back down to pre-incident levels as of approximately 19:00 US/Pacific. Downstream messaging should be fully fixed. Upstream messages may still experience moderate delays. We will continue to analyze the root cause of this issue and put safeguards in place to prevent and mitigate future recurrences of this event.
	Sep 14, 2018	13:31	We are encouraged to see that our solution is improving overall latency during our initial testing. We plan to continue rolling this out. You may have noticed some reduction in http send latency starting at ~8:30am pacific. today. We continue to work hard to resolve this issue.
	Sep 14, 2018	00:42	We are rolling out changes to mitigate the latency issue. We'll provide the next status update when the rollout has completed.
	Sep 13, 2018	14:50	We've identified the cause and are working to get this resolved as soon as possible.
	Sep 13, 2018	00:13	We are still working to address the underlying cause of the latency issue with FCM messages. We will provide another status update as soon as possible.
	Sep 12, 2018	14:01	The latency issue with FCM should be mostly resolved, but some users may still experience intermittent latency and error rates. We are working to address the underlying cause. We will provide another status update as soon as possible.
	Sep 12, 2018	03:25	We are still investigating the latency issue with FCM messages. We will provide another status update as soon as we have more information.
	Sep 12, 2018	01:12	We are experiencing a latency issue with FCM delivery of messages beginning at 2018-09-11 14:53 US/Pacific. For everyone who is affected, we apologize for any inconvenience you may be experiencing. We will provide an update by 03:00 US/Pacific with current details.

All times are US/Pacific