On Wednesday 08 April, 2020 beginning at 06:48 US/Pacific, Google Cloud Identity and Access Management (IAM) experienced significantly elevated error rates for a duration of 54 minutes. IAM is used by several Google services to manage user information, and 2/
To our Cloud customers whose businesses were impacted during this disruption, we sincerely apologize – we have conducted a thorough internal investigation and are 4/
Many Cloud services depend on a distributed Access Control List (ACL) in Cloud Identity and Access Management (IAM) for validating permissions, activating new 5/
The trigger of this incident was a rarely-exercised type of configuration change in 7/
[1] research.google/pubs/pub48190/
# REMEDIATION AND PREVENTION
Google engineers were automatically alerted to elevated error rates affecting Cloud IAM at 2020-04-08 06:52 US/Pacific and immediately began 10/
Google's standard production practice is to push any change gradually, in increments designed to maximize the probability of detecting problems before they have broad impact. Furthermore, we adhere to a 13/
We truly understand how important regional reliability is for our users and deeply apologize for this incident. # DETAILED DESCRIPTION OF IMPACT
On Wednesday 08 April, 2020 from 6:48 to 7:42 US/Pacific, Cloud IAM experienced an outage, which had varying 17/
Experienced a 100% error rate globally on all internal Cloud IAM API requests from 6:48 - 7:42. Upon the internal Cloud IAM service becoming unavailable (which impacted downstream Cloud 18/
## Gmail
Experienced delays receiving and sending emails from 6:50 to 7:39. For inbound emails, 20% G Suite emails, 21% of G Suite customers, and 0.3% of consumer emails were affected. For outbound emails 21/
## Compute Engine
Experienced a 100% error rate when 22/
Experienced a 100% error rate when performing instance creation, deletion, backup, and failover operations globally for high-availability (HA) 23/
To prevent HA Cloud SQL instances from encountering these failures in the future, we will change the auto-failover system to avoid triggering based on IAM issues. We are also re-examining the auto-failover system more generally to make 27/
## Cloud Pub/Sub
Experienced 100% error rates globally for Topic administration operations (create, get, and list) from 6:48 - 7:42.
## Kubernetes Engine
Experienced a 100% error 28/
## BigQuery
Datasets.get and projects.getServiceAccount experienced nearly 100% failures globally from 6:48 - 7:42. Other dataset operations experienced elevated error rates up to 40% for the duration of the 29/
Additionally, BigQuery streaming in the US multi-region experienced 31/
## App Engine
Experienced a 100% error rate when creating, updating, or 32/
## Cloud Run
Experienced a 100% error rate when creating, updating, or deleting deployments globally from 6:48 to 7:42. Public services did not have HTTP serving 33/
## Cloud Functions
Experienced a 100% error rate when creating, updating, or deleting functions with access control [2] globally from 6:48 to 7:42. Public functions did not have HTTP serving affected.
[2] 34/
## Cloud Monitoring
Experienced intermittent errors when listing workspaces via the Cloud Monitoring UI from 6:42 - 7:42.
## Cloud Logging
Experienced average and peak error rates of 60% for ListLogEntries API 35/
## Cloud 36/
Experienced 100% error rates on several administrative operations including job creation, deletion, and autoscaling from 6:55 - 7:42. ## Cloud Dataproc
Experienced a 100% error rate when attempting to create clusters globally from 6:50 - 7:42.
## Cloud Data 37/
Experienced a 100% error rate for create instance operations globally from 6:48 - 7:42. ## Cloud Composer
Experienced 100% error rates when creating, updating, or deleting Cloud Composer environments globally between 6:48 - 7:42. Existing environments were 38/
## Cloud AI Platform Notebooks
Experienced elevated average error rates of 97.2% (peaking to 100%) from 6:52 - 7:48 in the following regions: asia-east1, asia-northeast1, asia-southeast1, australia-southeast1, europe-west1, northamerica-northeast1, 39/
## Cloud KMS
Experienced a 100% error rate for Create operations globally from 6:49 - 7:40.
## Cloud Tasks
Experienced an average error rate of 8% (up to 15%) for CreateTasks, and a 96% error rate for AddTasks in the 40/
Experienced 100% error rates for CreateJob and UpdateJob requests globally from 6:48 - 7:42.
## App Engine Task Queues
Experienced an average error rate of 18% (up to 25% at peak) for UpdateTask requests from 6:48 - 42/
## Cloud Build
Experienced no API errors, however, all builds submitted between 6:48 and 7:42 were queued until the issue was resolved.
## Cloud Deployment Manager
Experienced an elevated average error rate of 20%, peaking to 36% for operations globally 43/
Experienced a 100% error rate for API operations globally from 6:48 - 7:42.
## Firebase Real-time Database
Experienced elevated average error rates of 7% for REST API and long-polling requests (peaking to 10%) during the 44/
Experienced elevated average error rates of 85% (peaking to 100%) globally for Android tests running on virtual devices in Google Compute Engine instances. Impact lasted from 6:48 - 7:54 for a duration of 1 hour and 6 minutes.
## 45/
Experienced a 100% error rate when creating new versions globally from 6:48 - 7:42.
## Firebase Console
Experienced a 100% error rate for developer resources globally. Additionally, the Firedata API experienced an average error rate of 20% for API 46/
Affected customers experienced a range of issues related to the Firebase Console and API. API invocations returned empty lists of projects, HTTP 404 errors, affected customers were unable to create, delete, update, or list many Firebase 47/
Experienced a 100% error rate when performing DeleteRegistry API calls from 6:48 - 7:42. Though DeleteRegistry API calls threw errors, the deletions issued did complete successfully.
## 49/
Experienced a 100% error rate for create, update, cancel, delete, and ListInstances operations on Redis instances globally from 6:48 - 7:42.
## Cloud Filestore
Experienced an average error rate of 70% for instance and snapshot creation, update, list, 50/
## Cloud Healthcare and Cloud Life Sciences
Experienced a 100% error rate for CreateDataset operations globally from 6:48 - 7:42.
# SLA CREDITS
If you believe your paid 51/
A full list of all Google Cloud Platform Service Level Agreements can be found at 52/
For G Suite, please request an SLA credit through one of the Support channels: support.google.com/a/answer/104721
G Suite Service Level Agreement can be found at gsuite.google.com/intl/en/terms/… 53x