In April, two different incidents occurred, with significant impact and degradation on the availability of Codespaces and GitHub packages.
April 1st 7:07 UTC (5 hours 32 minutes lasting)
Our alert detected an increase in failures to create a new code space in the western United States and start an existing stopped code space. We immediately updated the GitHub status page and started investigating.
Further investigation revealed that some of the secrets used by the Codespaces service have expired. Codespaces maintains a warm pool of resources to protect users from intermittent failures with dependent services. However, in the western United States, these pools were out of resources because the secrets had expired. In this case, there was not enough early warning that the pool had reached a low threshold and there was no time to respond until it ran out of capacity. As we worked to mitigate the incident, pools in other areas were also emptied due to the expiration of the secret, and failures began to occur in those areas as well.
A limited number of GitHub engineers have access to rotate the secret, and communication issues have delayed the start of the secret update process. The expired secret was finally updated and deployed to all regions, and the service is now fully functional.
To prevent this failure pattern in the future, place a monitor to check for expired resources and proactively warn you if pool resources are not maintained. We’ve also added a monitor to notify you early when you’re nearing resource exhaustion limits. In addition, we have begun migrating our services to use a mechanism that does not depend on the need for secret or credential rotation.
April 14th 20:35 UTC (lasting 4 hours 53 minutes)
We are still investigating the factors and will provide a more detailed update in the May Availability Report, which will be published on the first Wednesday of June. We will also share our efforts to minimize the impact of future incidents.
April 25, 8:59 UTC (5 hours 8 minutes lasting)
During this incident, the alert system detected an increase in CPU utilization in one of the GitHub package registry databases. This started about an hour before the customer impact occurred. The threshold for this alert was relatively low and was not a paging alert, so we didn’t investigate it immediately. As the CPU continued to rise in the database, the package registry began responding to requests with internal server errors, ultimately affecting customers. This increase in activity was due to the unexpected use of a large number of Create Manifest commands.
Throttle criteria configured at the database level weren’t enough to limit the above commands, so all users using the GitHub package registry stopped. The user was unable to push or pull the package and was unable to access the package UI or the landing page of the repository.
Investigation revealed that there were a large number of performance bugs associated with the Create Manifest command. To limit the impact and restore normal operation, we have blocked the activity that caused this issue. We are actively following up on this issue by improving package rate limits and fixing any performance issues found. We’ve also changed the database alert thresholds and severity to get alerts faster (rather than after customer impact) for unexpected issues.
During this incident, I also discovered that the home page of the repository relies heavily on the package infrastructure. If the package registry goes down, you will not be able to load the home page of the repository that lists the packages. I detached the package list from the repository home page, but had to manually intervene while it was down. We are working on a fix that loosely binds the package list, so if it fails, the repository home page of the repository listing the packages will not be deleted.
We will keep you up to date on the progress and investments we make to ensure the reliability of our services.Please follow us Status page For real-time updates. For more information on what we are working on GitHub Engineering Blog..
https://github.blog/2022-05-04-github-availability-report-april-2022/ GitHub Availability Report: April 2022