There was an interesting news just in the end of January 2017.
On January 31th, 2017 Gitlab accidentally deleted their production database (git repositories were not affected though).
What happened. For some reason, replicatation started lagging (PostgreSQL). One of the Gitlab employee some tried to fix the problem by playing with different settings but it did not help. Then, at some point, that employee decided to delete everything and rebuld the replica again. He (or she) tried to delete the folder with the replica data, but mixed up servers and removed the folder on the master (rm -rf on did db1.cluster.gitlab.com instead db2.cluster.gitlab.com).
It could have been not as bad but they realised they had no backups:
- LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
- Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
- SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
- Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
- The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
- The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
- SH: We learned later the staging DB refresh works by taking a snapshot of the gitlab_replicator directory, prunes the replication configuration, and starts up a separate PostgreSQL server.
- Our backups to S3 apparently don’t work either: the bucket is empty
- We don’t have solid alerting/paging for when backups fails, we are seeing this in the dev host too now.
Here is the link to Gitlab's incident report.
I guess that employee deserved a bonus for finding problems with backups. :)