Resolved: High Google Cloud Networking incident: Global: Experiencing Issue with Cloud networking status.cloud.google.com/incidents/6PM5… ## INCIDENT REPORT ## Introduction We apologize for any impact the service disruption on Tuesday, 16 November 1/30
2021, may have had on your organization Thank you for your patience and understanding as we worked to resolve the issue. We want to share some information about what happened and the steps we are taking to ensure this issue doesn’t occur again 2/30
We also want to assure you that this service disruption does not have any bearing on our preparedness or platform reliability going into Black Friday/Cyber Monday (BFCM) 3/30
## Incident Summary On Tuesday, 16 November 2021 at 09:35 PT, Google Cloud Networking experienced issues with the Google External Proxy Load Balancing (GCLB) service. Affected customers received Google 404 errors in response to HTTP/S requests 4/30
Google engineers were alerted to the issue via automated alerting at 09:50 PT, which aligned with incoming customer support requests, and we immediately started to mitigate the issue by rolling back to the last known good configuration 5/30
Between 09:35 and 10:08 PT, customers affected by the outage may have encountered 404 errors when accessing any web page (URL) served by Google External Proxy Load Balancing 6/30
A rollback to the last known good configuration completed at 10:08 PT, which resolved the 404 errors. To avoid the risk of a recurrence, our engineers suspended customer-initiated configuration changes in GCLB 7/30
As a result, GCLB service customers were unable to make changes to their load balancing configuration between 10:04 and 11:28 PT 8/30
During the change suspension period, we validated the fix to safeguard against recurrence and deployed additional proctoring and monitoring to ensure safe resumption of service. By 11:28 PT, customer configuration pushes resumed, and normal service was restored 9/30
The total duration of impact was 1 hour and 53 minutes. ## Root Cause This incident was caused by a bug in the configuration pipeline that propagates customer configuration rules to GCLB 10/30
The bug was introduced 6 months ago and allowed a race condition (when behavior depends on the timing of data accesses) that would, in very rare cases, push a corrupted configuration file to GCLB 11/30
The GCLB update pipeline contains extensive validation checks to prevent corrupt configurations, but the race condition was one that could corrupt the file near the end of the pipeline 12/30
A Google engineer discovered this bug on 12 November, which caused us to declare an internal high-priority incident because of the latent risk to production systems 13/30
After analyzing the bug, we froze a part of our configuration system to make the likelihood of the race condition even lower. Since the race condition had existed in the fleet for several months already, the team believed that this extra step made the risk even lower 14/30
Thus the team believed the lowest-risk path, especially given the proximity to BFCM, was to roll out fixes in a controlled manner as opposed to a same-day emergency patch 15/30
We developed two mitigations: patch A closed the race condition itself; and patch B added additional input validation to the binary receiving the configuration to prevent it from accepting the new configuration, even if the race condition occurred 16/30
Both patches were ready and verified to fix the problem by 13 November. Gradual rollouts of both patches started on Monday, 15 November, and patch B completed rollout by that evening 17/30
On Tuesday, 16 November, as the patch A rollout was within 30 minutes of completing, the race condition did manifest in an unpatched cluster, and the outage started 18/30
Additionally, even though patch B did protect against the kind of input errors observed during testing, the actual race condition produced a different form of error in the configuration, which the completed rollout of patch B did not prevent from being accepted 19/30
Once the root cause was identified, our engineers mitigated the issue by restoring a known-good configuration, and completed and verified the fix, which eliminates the risk of recurrence 20/30
## Service(s) Affected: - Google Cloud Networking: Customer HTTP/S endpoints served 404 error pages. During partial recovery, traffic was served, but customers were unable to make changes to their load balancer configurations 21/30
- GCLB can be used to load balance traffic to a number of other Google Cloud services, which lost traffic because of the outage 22/30
Customers who use serverless network endpoint groups on GCLB as a frontend to Google Cloud Run, Google App Engine, Google App Engine Flex, or Google Cloud Functions received 404 errors when attempting to access their service 23/30
Customers using Apigee, Firebase, or Google App Engine Flex received 404 errors when attempting to access their service 24/30
## Zone(s) Affected: Global ## How Customers Experienced the Issue: Between 09:35 and 10:08 PT, most endpoints served by global GCLB load balancers returned a 404 error 25/30
For an additional 1 hour and 20 minutes, customers were unable to make changes to their load balancing configuration. ## Workaround(s): None 26/30
Service was restored on 16 November 2021 at 11:28 PT, and the Google Cloud Status Dashboard was updated by 12:08 PT to reflect this 27/30
## Remediation and Prevention We have fixed the underlying bug and are taking the following actions to prevent recurrence: We immediately added additional alerting, which will notify us to similar issues significantly faster going forward 28/30
We are adding safeguards to prevent similar issues from occurring in the future. These safeguards provide strengthened automated correctness-checking to configurations before they are applied 29/30
We are accelerating planned architectural changes that will improve how we isolate and resolve such issues in the future. 30/30

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with GCP Incidents

GCP Incidents Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @GCP_Incidents

13 Apr 20
Resolved: High Google Cloud infrastructure components incident: We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services #googlecloud status.cloud.google.com/incident/zall/… # ISSUE SUMMARY (All times in US/Pacific 1/
daylight time)

On Wednesday 08 April, 2020 beginning at 06:48 US/Pacific, Google Cloud Identity and Access Management (IAM) experienced significantly elevated error rates for a duration of 54 minutes. IAM is used by several Google services to manage user information, and 2/
the elevated IAM error rates resulted in degraded performance that extended beyond 54 minutes for the following Cloud services: - Google BigQuery’s streaming service experienced degraded performance for 116 minutes; - Cloud IAM’s external API returned elevated errors for 3/
Read 53 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Thank you for your support!

Follow Us on Twitter!

:(