Yesterday, Superhuman was down for nearly two hours due to a failure with our database. I am deeply sorry for this. We know that email is mission critical, and that this much downtime is unacceptable.
During the downtime, emails could not be sent and it was not possible to log into Superhuman. If you were already logged in, you could still receive emails. If you were not logged in, then it was not possible to log in.
The failure was due to two simultaneous issues:
1. Our database was running low on disk space.
2. One of the availability zones that our database runs in was unable to provision more disk space.
For our database, we use Google Cloud SQL in High Availability mode. We also use the built-in feature to “automatically increase disk space”. We failed to realize two important things about this setup:
1. The automatic disk space increase is very conservative. Based on current load, it would only allocate enough space for a few additional hours at peak traffic.
2. Increasing disk space is an operation that requires both availability zones to be active.
We spoke with Google Cloud Support who explained all off this in detail, and then we took the decision to temporarily disable high availability so that we could resize the primary database.
This is the timeline of events:
09:40. The auto-scaler detected we had less than 25GB of free space and started to increase capacity, but this failed.
12:03. Our database ran out of disk space.
12:03-12:11. We tried to manually increase disk space and failover to another zone, but these both failed.
12:11. We opened a ticket with Google Cloud Support.
12:59. We were on the phone with Google, who provided a detailed explanation of the issue.
13:30. We disabled high availability on our database, and resized it in the working zone.
13:34. Our database was back up again.
13:34-13:50. Clients began to reconnect and send email.
13:59. Normal operations resume, though our database is not high availability for the time being.
As a result of this incident, we are going to make several changes:
1. Tonight, we are going to re-enable high availability on our database. This will cause ~10m of downtime, but we will do it when we have our lowest traffic: 11:50 pm PST.
2. We have built our own database auto-scaler that will trigger much before the built-in auto-scaler.
3. We have added alerting on database disk-utilization metrics so that we can pre-empt any similar failures.
4. We will fix the client so that if the backend is unexpectedly down, it will not log you out so that you can continue to read and process email.
5. We are going to practice failing over to our secondary read replica. This will be helpful if we are ever again in a situation where both our primary-replica pairs are not functioning.
Again, I am truly sorry that this happened. These steps will ensure that we do not have a similar incident in the future.
If you have any questions, please just ask: firstname.lastname@example.org.