12,399 views

GCP Incidents

@GCP_Incidents

, 53 tweets, 10 min read

My Authors

Resolved: High Google Cloud infrastructure components incident: We are investigating an issue with elevated error rates across multiple Google Cloud Platform Services #googlecloud status.cloud.google.com/incident/zall/… # ISSUE SUMMARY (All times in US/Pacific 1/

daylight time)

On Wednesday 08 April, 2020 beginning at 06:48 US/Pacific, Google Cloud Identity and Access Management (IAM) experienced significantly elevated error rates for a duration of 54 minutes. IAM is used by several Google services to manage user information, and 2/

the elevated IAM error rates resulted in degraded performance that extended beyond 54 minutes for the following Cloud services: - Google BigQuery’s streaming service experienced degraded performance for 116 minutes; - Cloud IAM’s external API returned elevated errors for 3/

102 minutes; - 3% of Cloud SQL HA instances were degraded for durations ranging from 54 to 192 minutes.

To our Cloud customers whose businesses were impacted during this disruption, we sincerely apologize – we have conducted a thorough internal investigation and are 4/

taking immediate action to improve the resiliency, performance, and availability of our service. # ROOT CAUSE

Many Cloud services depend on a distributed Access Control List (ACL) in Cloud Identity and Access Management (IAM) for validating permissions, activating new 5/

APIs, or creating new Cloud resources. Cloud IAM in turn relies on a centralized and planet-scale system to manage and evaluate access control for data stored within Google, known as Zanzibar [1]. Cloud IAM consists of regional and global instances; regional instances are 6/

isolated from each other and from the global instance for reliability. However, some specific IAM checks, such as checking an organizational policy, reference the global IAM instance.

The trigger of this incident was a rarely-exercised type of configuration change in 7/

Zanzibar which also impacted Cloud IAM. A typical change to this configuration mutates existing configuration namespaces, and is gradually rolled out through a sequence of canary steps. However, in this case, a new configuration namespace was added, and a latent issue with 8/

our canarying system allowed this specific type of configuration change to propagate globally in a rapid manner. As the configuration was pushed to production, the global Cloud IAM service quickly began to experience internal errors. This resulted in downstream operations 9/

with a dependency on global Cloud IAM to fail.

[1] research.google/pubs/pub48190/

# REMEDIATION AND PREVENTION

Google engineers were automatically alerted to elevated error rates affecting Cloud IAM at 2020-04-08 06:52 US/Pacific and immediately began 10/

investigating. By 07:27, the engineering team responsible for managing Zanzibar identified the configuration change responsible for the issue, and swiftly reverted the change to mitigate. The mitigation finished propagating by 07:42, partially resolving the incident for a 11/

majority of internal services. Specific services such as the external Cloud IAM API, high-availability Cloud SQL, and Google BigQuery streaming took additional time to recover due to complications arising from the initial outage. Services with extended recovery timelines 12/

are described in the “detailed description of impact” section below.

Google's standard production practice is to push any change gradually, in increments designed to maximize the probability of detecting problems before they have broad impact. Furthermore, we adhere to a 13/

philosophy of defence-in-depth: when problems occur, rapid mitigations (typically rollbacks) are used to restore service within service level objectives. In this outage, a combination of bugs resulted in these practices failing to be applied effectively. In addition to 14/

rolling back the configuration change responsible for this outage, we are fixing the issue with our canarying and release system that allowed this specific class of change to rapidly roll out globally; instead, such changes will in the future be subject to multiple layers 15/

of canarying, with automated rollback if problems are detected, and a progressive deployment over the course of multiple days. Both Cloud IAM and Zanzibar will enter a change freeze to prevent the possibility of further disruption to either service before these changes are 16/

implemented.

We truly understand how important regional reliability is for our users and deeply apologize for this incident. # DETAILED DESCRIPTION OF IMPACT

On Wednesday 08 April, 2020 from 6:48 to 7:42 US/Pacific, Cloud IAM experienced an outage, which had varying 17/

degrees of impact on downstream services as described in detail below. ## Cloud IAM
Experienced a 100% error rate globally on all internal Cloud IAM API requests from 6:48 - 7:42. Upon the internal Cloud IAM service becoming unavailable (which impacted downstream Cloud 18/

services), the external Cloud IAM API also began returning HTTP 500 INTERNAL_ERROR codes. The rate and volume of incoming requests (due to aggressive retry policies) triggered the system’s Denial of Service (DoS) protection mechanism. The automatic DoS protection throttled 19/

the service, implementing a rate-limit on incoming requests resulting in query failures and a large volume of retry attempts. Upon the incident’s mitigation, the DoS protection was removed but took additional time to propagate across the fleet. Its removal finished 20/

propagating by 8:30, returning the service to normal operation.

## Gmail
Experienced delays receiving and sending emails from 6:50 to 7:39. For inbound emails, 20% G Suite emails, 21% of G Suite customers, and 0.3% of consumer emails were affected. For outbound emails 21/

(including Gmail-to-Gmail) 1.3% of G Suite emails, and 0.3% of consumer emails were affected. Message delay period varied, with the 50th percentile peaking at 3.7 seconds, up to 2580 seconds for the 90th percentile.

## Compute Engine
Experienced a 100% error rate when 22/

performing firewall modifications or create, update, or delete instance operations globally from 6:48 to 7:42. ## Cloud SQL
Experienced a 100% error rate when performing instance creation, deletion, backup, and failover operations globally for high-availability (HA) 23/

instances from 6:48 - 7:42, due to the inability to authenticate VMs via the Cloud IAM service. Additionally, Cloud SQL experienced extended impact from this outage for 3% of HA instances. Such instances initiated failover when upstream health metrics were not propagated 24/

due to the Cloud IAM issues. HA instances automatically failed over in an attempt to recover from what was believed to be failures occurring on the master instances. Upon failing over, these instances became stuck in a failed state. The Cloud IAM outage prevented the 25/

master’s underlying data disk from being attached to the failover instance, leaving the failover instance in a stuck state. These stuck instances required manual engineer intervention to bring them back online. Affected instances impact ranged from 6:48 - 10:00 for a total 26/

duration of 3 hours and 12 minutes.

To prevent HA Cloud SQL instances from encountering these failures in the future, we will change the auto-failover system to avoid triggering based on IAM issues. We are also re-examining the auto-failover system more generally to make 27/

sure it can distinguish a real outage from a system-communications issue going forward.

## Cloud Pub/Sub
Experienced 100% error rates globally for Topic administration operations (create, get, and list) from 6:48 - 7:42.

## Kubernetes Engine
Experienced a 100% error 28/

rate for cluster creation requests globally from 6:49 - 7:42.

## BigQuery
Datasets.get and projects.getServiceAccount experienced nearly 100% failures globally from 6:48 - 7:42. Other dataset operations experienced elevated error rates up to 40% for the duration of the 29/

incident. BigQuery streaming was also impacted in us-east1 for 6 minutes, us-east4 for 20 minutes, asia-east1 for 12 minutes, asia-east2 for 40 minutes, europe-north1 for 11 minutes, and the EU multi-region for 52 minutes. With most of the above regions experiencing up to a 30/

maximum of 30% average error rates. The EU multi-region, US multi-region, and us-east2 regions specifically experienced higher error rates, reaching nearly 100% for the duration of their impact windows.

Additionally, BigQuery streaming in the US multi-region experienced 31/

issues coping with traffic volume once IAM recovered. BigQuery streaming in the US multi-region experienced a 55% error rate from 7:42 - 8:44 for a total impact duration of 1 hour and 56 minutes.

## App Engine
Experienced a 100% error rate when creating, updating, or 32/

deleting app deployments globally from 6:48 to 7:42. Public apps did not have HTTP serving affected.

## Cloud Run
Experienced a 100% error rate when creating, updating, or deleting deployments globally from 6:48 to 7:42. Public services did not have HTTP serving 33/

affected.

## Cloud Functions
Experienced a 100% error rate when creating, updating, or deleting functions with access control [2] globally from 6:48 to 7:42. Public functions did not have HTTP serving affected.

[2] 34/

cloud.google.com/functions/docs…

## Cloud Monitoring
Experienced intermittent errors when listing workspaces via the Cloud Monitoring UI from 6:42 - 7:42.

## Cloud Logging
Experienced average and peak error rates of 60% for ListLogEntries API 35/

calls from 6:48 - 7:42. Affected customers received INTERNAL_ERRORs. Additionally, create, update, and delete sink calls experienced a nearly 100% error rate during the impact window. Log Ingestion and other Cloud Logging APIs were unaffected.

## Cloud 36/

Dataflow
Experienced 100% error rates on several administrative operations including job creation, deletion, and autoscaling from 6:55 - 7:42. ## Cloud Dataproc
Experienced a 100% error rate when attempting to create clusters globally from 6:50 - 7:42.

## Cloud Data 37/

Fusion
Experienced a 100% error rate for create instance operations globally from 6:48 - 7:42. ## Cloud Composer
Experienced 100% error rates when creating, updating, or deleting Cloud Composer environments globally between 6:48 - 7:42. Existing environments were 38/

unaffected.

## Cloud AI Platform Notebooks
Experienced elevated average error rates of 97.2% (peaking to 100%) from 6:52 - 7:48 in the following regions: asia-east1, asia-northeast1, asia-southeast1, australia-southeast1, europe-west1, northamerica-northeast1, 39/

us-central1, us-east1, us-east4, and us-west1.

## Cloud KMS
Experienced a 100% error rate for Create operations globally from 6:49 - 7:40.

## Cloud Tasks
Experienced an average error rate of 8% (up to 15%) for CreateTasks, and a 96% error rate for AddTasks in the 40/

following regions: asia-northeast3, asia-south1, australia-southeast1, europe-west1, europe-west6, northamerica-northeast1, southamerica-east1, us-central1, us-east4, and us-west3.. Delivery of existing tasks were unaffected, but downstream services may have experienced 41/

other issues as documented. ## Cloud Scheduler
Experienced 100% error rates for CreateJob and UpdateJob requests globally from 6:48 - 7:42.

## App Engine Task Queues
Experienced an average error rate of 18% (up to 25% at peak) for UpdateTask requests from 6:48 - 42/

7:42.

## Cloud Build
Experienced no API errors, however, all builds submitted between 6:48 and 7:42 were queued until the issue was resolved.

## Cloud Deployment Manager
Experienced an elevated average error rate of 20%, peaking to 36% for operations globally 43/

between 6:49 and 7:39. ## Data Catalogue
Experienced a 100% error rate for API operations globally from 6:48 - 7:42.

## Firebase Real-time Database
Experienced elevated average error rates of 7% for REST API and long-polling requests (peaking to 10%) during the 44/

incident window. ## Firebase Test Lab
Experienced elevated average error rates of 85% (peaking to 100%) globally for Android tests running on virtual devices in Google Compute Engine instances. Impact lasted from 6:48 - 7:54 for a duration of 1 hour and 6 minutes.

## 45/

Firebase Hosting
Experienced a 100% error rate when creating new versions globally from 6:48 - 7:42.

## Firebase Console
Experienced a 100% error rate for developer resources globally. Additionally, the Firedata API experienced an average error rate of 20% for API 46/

operations from 6:48 - 7:42

Affected customers experienced a range of issues related to the Firebase Console and API. API invocations returned empty lists of projects, HTTP 404 errors, affected customers were unable to create, delete, update, or list many Firebase 47/

entities including (Android, iOS, and Web Apps), hosting sites, Real-time Database instances, Firebase-linked GCP buckets. Firebase developers were also unable to update billing settings. Firebase Cloud Functions could not be deployed successfully. Some customers 48/

experienced quota exhaustion errors due to extensive retry attempts. ## Cloud IoT
Experienced a 100% error rate when performing DeleteRegistry API calls from 6:48 - 7:42. Though DeleteRegistry API calls threw errors, the deletions issued did complete successfully.

## 49/

Cloud Memorystore
Experienced a 100% error rate for create, update, cancel, delete, and ListInstances operations on Redis instances globally from 6:48 - 7:42.

## Cloud Filestore
Experienced an average error rate of 70% for instance and snapshot creation, update, list, 50/

and deletion operations, with a peak error rate of 92% globally between 6:48 and 7:45.

## Cloud Healthcare and Cloud Life Sciences
Experienced a 100% error rate for CreateDataset operations globally from 6:48 - 7:42.

# SLA CREDITS
If you believe your paid 51/

application experienced an SLA violation as a result of this incident, please populate the SLA credit request: support.google.com/cloud/contact/…

A full list of all Google Cloud Platform Service Level Agreements can be found at 52/

cloud.google.com/terms/sla/.

For G Suite, please request an SLA credit through one of the Support channels: support.google.com/a/answer/104721

G Suite Service Level Agreement can be found at gsuite.google.com/intl/en/terms/… 53x

Enjoying this thread?

Try unrolling a thread yourself!

Enjoying this thread?

Try unrolling a thread yourself!

Related hashtags

Embed code for your website

Did Thread Reader help you today?