This past weekend we at 4Degrees completed a database transition from Microsoft Azure’s managed PostgreSQL service to a self-hosted PostgreSQL solution with Barman for backups.
We had adopted Azure’s solution early on in our company history because 1) it came with robust automated backup and recovery capabilities and 2) we were a part of Microsoft’s generous startup sponsorship package, so the additional cost of the service was covered by credits. But over the past year or so we’ve had a number of issues with the service; most notably, it was our single largest source of unplanned site downtime (it had a habit of triggering dozens of outage emails to us in the middle of the night).
Our Azure Postgres issues came to a head in the past month or two: as we’ve scaled, we’ve run into a number of performance issues. We needed to implement a trigram-based index to speed up our searches on text fields, but a configuration issue with Azure’s service prevented us from doing so. The impact to our customers on top of the uptime issues were enough to push us back on to our own service.
Architecture
Taken from Barman documentation
There are a few different ways to set up PostgreSQL with Barman. We opted for continuously streaming backups with WALs (Write Ahead Logs), as this gave us the same RPO (Recovery Point Objective — the amount of information you lose in a recovery) of 0 that we had with Azure’s service. Missing from the diagram is the ad-hoc SSH connection back from the Barman server to the PostgreSQL server used during recoveries.
We run the streaming backup process every night, essentially creating a checkpoint of the database at that point in time, with a 7-day retention policy. The WAL streaming process runs continuously, retaining a record of all changes that have occurred since the last checkpoint.
Virtual Machines
For the PostgreSQL server, we opted for Azure’s DS3 v2 VM, with 4 vCPUs and 14GB of memory. This gave us slightly more performance than we had with the managed service. We included a 1.1TB managed disk; although we don’t need nearly that much storage today, the read/write speed of Azure’s managed disks goes up along with the disk size.
For the Barman server, we opted for a more basic D2s v3, with 2 vCPUs and 8GB of memory. In comparison to the DS series, the D series has less read/write performance (and is cheaper). We weren’t sure that the Barman server needed this much CPU or memory, but figured we didn’t want to run into any hanging issues while trying to execute a recovery. We chose a 250GB managed disk, though we may need to increase that depending on how space the 7-day retention policy ends up using.
Both VMs were set up with Ubuntu 16.04.
Setting Up PostgreSQL
If you already know how to set up a PostgreSQL server, you can skip this section. The instructions for setting up Barman are self-contained after this.
1/ Mount managed disk
We followed some steps to set up the 1.1TB disk at /data. The official directions provided by Microsoft did not work for us, and based on some Googling have not worked for others.
2/ Install PostgreSQL
$ add-apt-repository "deb https://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main"$ apt-get update$ apt-get install postgresql-9.6
3/ Configuration
- Copy the PostgreSQL data folder structure over to the storage location on the managed disk: $ rsync -av /var/lib/postgresql /data
- Open up the PostgreSQL configuration file at /etc/postgresql/9.6/main/postgresql.conf and make the following changes:
data_directory = '/data/postgresql/9.6/main'listen_addresses = '*'port = 5432
- Open up the PostgreSQL connection configuration file at/etc/postgresql/9.6/main/pg_hba.conf and change the IPv4 local connections to:
host all all <your CIDR IP range> md5
- Restart PostgreSQL with systemctl restart postgresql
- Make sure your data directory was switched correctly:
$ sudo -u postgres psql=# SHOW data_directory;
4/ Create database
$ sudo -u postgres psql=# create database <database_name>;=# create user <user_name> with password '<password';=# grant all privileges on database "<database_name>" to <user_name>;
5/ Load data from remote
This step is only if you’re migrating a database, as we were. It took ~5–10 minutes per GB of data.
pg_dump -v -C -h <previous_db_host> -U <previous_db_user_name><previous_db_name> | sudo -u postgres psql <new_db_name>
At this point, your PostgreSQL server should be set up. Feel free to test the connection / functionality however you’d like. The rest is dedicated to setting up Barman for automatic backup / recovery.
Setting Up Barman
1/ Mount managed disk
We followed some steps to set up the 250GB disk at /data. The official directions provided by Microsoft did not work for us, and based on some Googling have not worked for others.
2/ Install PostgreSQL
$ add-apt-repository "deb https://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main"$ apt-get update$ apt-get install postgresql-9.6
3/ Install Barman
$ curl https://dl.2ndquadrant.com/default/release/get/deb | sudo bash$ sudo apt-get update$ sudo apt-get install barman
4/ Configure Barman
Open up /etc/barman.conf and change/confirm:
barman_user = barmanconfiguration_files_directory = /etc/barman.dbarman_home = /data/barmanlog_file = /var/log/barman/barman.loglog_level = INFOcompression = gzipretention_policy = RECOVERY WINDOW OF 1 WEEKSwal_retention_policy = main
5/ Configure server connection
Create a new file at /etc/barman.d/pg.conf and add:
[pg]description = "Streaming PostgreSQL backup"conninfo = host=<database_host> user=barman dbname=<database_name>streaming_conninfo = host=<database_host> user=barmanbackup_method = postgresstreaming_archiver = onslot_name = barman
6/ Set up storage directory
$ mkdir /data/barman$ chown barman:barman /data/barman
7/ Create Barman user for PostgreSQL
- On the PostgreSQL server, run $ createuser -s -P barman. You’ll be prompted for a password. Make sure to write down whatever you enter.
- Back on the Barman server, create a new file at ~barman/.pgpass and add:
<database_host>:5432:*:barman:<barman_password>
.pgpass should have 0600 permissions, which you can set with chmod 0600 ~barman/.pgpass
- Test the psql connection:
$ sudo su barman$ psql -c 'SELECT version()' -U barman -h <database_host><database_name>
8/ Configure PostgreSQL for streaming (for backup)
- Head back to the PostgreSQL server and open /etc/postgresql/9.6/main/postgresql.conf, change/confirm:
wal_level = 'replica'max_wal_senders = 2max_replication_slots = 2
- Open /etc/postgresql/9.6/main/pg_hba.conf, add:
host replication barman <your CIDR IP range> md5
- Restart PostgreSQL: $ systemctl restart postgresql
- Back on the Barman server, test the replication connection:
$ sudo su barman$ psql -U barman -h <database_host> -c 'IDENTIFY_SYSTEM' replication=1
7/ Set up SSH (for restores)
- On the Barman server, run:
$ sudo su barman$ ssh-keygen -t rsa
- Save the key to ~/.ssh/pg
- Copy the value of the public key in ~/.ssh/pg.pub
- On the PostgreSQL server, open or create ~postgres/.ssh/authorized_keys and add the public key
8/ Create replication slot
- On the Barman server run $ barman receive-wal — create-slot pg
9/ Test
- Test WAL streaming with $ barman switch-wal --force --archive pg
- Test everything with $ barman check pg
10/ Set up cron jobs
- As user barman, open up cron jobs with $ crontab -e and add:
* * * * * /usr/bin/barman cron0 4 * * * /usr/bin/barman backup pg
- You can also manually create a backup at any time with $ barman backup pg. This took us about 30 seconds per GB of data.
Recovery With Barman
1/ Stop the database
- Stop the PostgreSQL service on the PostgreSQL server with $ systemctl stop postgresql
2/ Run the recovery
- Get a list of possible backup spots with $ barman list-backup pg
- Get details for the backup you choose with $ barman show-backup pg <backup_id>. Note that <backup_id> is the timestamp of the backup.
- Run $ barman recover --remote-ssh-command "ssh -i ~/.ssh/pg -o StrictHostKeyChecking=no postgres@<database_host>" --target-time="<YYYY-MM-DD HH:MM:SS.SSSSSS+00:00>" pg <backup_id> /data/postgresql/9.6/main. This took us about 30 seconds per GB of data.
3/ Restart the database
- Back on the PostgreSQL server, run $ systemctl start postgresql
Update: our post-implementation learnings from 2 months later covering some of the additional learnings we’ve had with the system live.
- Stopping the database during recovery matters. When you’re looking to recover a backup onto your production DB, it’s critically important to stop the database first. We had documented this step originally, but didn’t fully understand its importance. If you don’t stop the DB, the recovery will still go through, but the database will be corrupted and you’ll have to run the recovery over again (an 18+ minute process for us).
- Automatic syncing breaks after a recovery. For some still-undetermined reason, our automated backup breaks after a recovery is performed. Our solution, for now, is to move the past Barman backups to an archive and start the backup process from scratch. It works, but certainly isn’t elegant.
- Monitoring is a must. Due to the previous issue (#2), we were without backups for a significant period of time (which we only discovered when we needed them). We’ve since been checking on the backups manually every day and discovered another problem recently when our disk ran out of storage space. Without monitoring, there’s no way to know if your backups are actually occurring or not. We’re now looking into using Ansible for a daily monitor of backup health.