The Quest for High Availability Application Part 3: Distributed Database Management with Patroni (+ Etcd)

Inigo Ramli
15 min readJun 25, 2021

In part 2, we have seen one of the ways to replicate a PostgreSQL server. Using stream replication, we can create multiple replicas of a single server that will stay consistent at all times (at least until there is a disaster). These replicas should remain silent (or be set to read-only status) in a “warm” condition, ready to take over the role of the primary server in case of disaster. This will theoretically make our server a highly available one. Unfortunately, this management process requires a lot of effort to be maintained manually. While Postgres allows replicas to mimic a primary server, it does not monitor the status of servers nor perform failovers should one went down. This means that in case of disaster, our capability to remain online depends on our speed of database maintenance, i.e how quickly can we tell our server to perform failovers and update the primary database information on all replicas. Luckily, this process can be automated using other tools.

In this extra part, we will look into one of the tools to automate database failovers. What we are looking for here is a High Availability Framework for PostgreSQL. There are several tools that fit this group, such as PostgreSQL Automatic Failover (PAF), repmgr, pg_auto_failover, and Patroni. The last one is the one that we will be delving into in this article.

Patroni is a High Availability Framework created by Zalando, which is interestingly not a tech company, but rather an online fashion platform. It is built specifically for PostgreSQL and can be used with various distributed configuration stores such as ZooKeeper, Consul, Kubernetes, and Etcd. We will look into these shortly. Patroni takes in a number of servers and manages the cluster creation, replication, recovery, and leader election. It also provides an API for other services to determine which servers are healthy and can be used.

Before we can look into configuring Patroni, we need to talk about Distributed Configuration Stores (DCS). What are those? Ever since the emergence of cloud computing and microservice, problems have arisen about how to keep information in sync between containers. There are dynamic configurations that need to be kept in sync in a distributed architecture, like the IP of nodes in a swarm, the hostname of the primary server in a cluster, and the health of other nodes in a K8s network. These types of information can change frequently, and nodes should be able to access them consistently (no two nodes should have different values of information at one time) at all times (even if one or more nodes experiences a failure).

Simply centralizing these values into a single machine does not resolve the problem, since then our system will break down if that one single node fails. In contrast, creating multiple nodes to store the same information without any sort of consistency check threatens the integrity of information needed. We can see that this problem is relevant to the goal of Patroni and other High Availability Platforms. Information like the current primary server’s IP needed to be stored consistently. The performance of HAF tools relies on how well the synchronization problem can be solved.

This is where DCS comes in. Tools like Etcd provides a system that tries to maintain a strongly consistent data structure that can tolerate disasters to a certain degree. Etcd, in particular, does this by implementing an “election system” like Docker Swarm manager nodes. This allows Etcd to run as long as there are more than half of the nodes online.

Etcd runs as a cluster. A cluster consists of several nodes. Ideally, one node is equal to a VM, or a bare-metal machine. Each node should be able to talk to one another via an internet connection. A cluster has the fault tolerance value equal to floor(N/2), where N is the number of nodes in a cluster. This means that if an etcd cluster contains 3 nodes, it can still run if one node is down. Etcd stores information as a set of key-value pair. A cluster has one leader at all times. The leader of a cluster manages requests from and to clients. Leaders are elected through an election. When a leader fails, another is elected by the remaining nodes.

We shall demonstrate this capability. Assume that three VMs are online, all located in the same subnet. This means that each node is connected through a VPC. We will be using Debian 10 OS as usual.

Apt does not list etcd as an application at the moment, so we will have to install it manually. Below are the commands needed to extract packages from CoreOS and install them on our machine.

$ ETCD_VER=v3.4.16$ ETCD_BIN=/usr/local/bin$ GITHUB_URL=https://github.com/etcd-io/etcd/releases/download$ DOWNLOAD_URL=${GITHUB_URL}$ mkdir -p $ETCD_BIN$ curl -L ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz$ tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C $ETCD_BIN --strip-components=1

Run the commands above in all of our instances. Then, execute etcd by running the command below on each machines. Don’t forget to change $THIS_IP with the Private IP of the node the command is run upon.

etcd --name $HOSTNAME --data-dir /var/lib/etcd/data \
--initial-advertise-peer-urls http://$THIS_IP:2380 \
--wal-dir /var/lib/etcd/wal \
--listen-peer-urls http://$THIS_IP:2380 \
--listen-client-urls http://$THIS_IP:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://$THIS_IP:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd1=http://10.128.0.20:2380,etcd2=http://10.128.0.21:2380,etcd3=http://10.128.0.22:2380 \
--initial-cluster-state new

There are several important properties here:

  • data-dir specifies the location of our stored data. Each node will contain its own copy of key values, regardless of whether they are the current leader or not.
  • wal-dir specifies the location of Write-Ahead Log files, a component used for recovery post-disaster.
  • listen-peer-urls is the URL used by other nodes in the cluster to communicate with this cluster
  • listen-client-urls is the URL used by outside clients to communicate with etcd cluster.
  • initial-cluster-token defines the name of this cluster
  • initial-cluster defines the name and IP of all nodes in our cluster. Since we already know the IP of all nodes that we will use and that it is not going to change for a long time, we will use this to statically created our cluster. This parameter is written in the form name-1=peer-url-1,name-2=peer-url-2,name-3=peer-url-3,... .

Open a separate terminal for each of the nodes. Verify that our cluster is running as one by running etcdctl member list . We can see the output below listed all three of our VMs:

etcd-1:~$ etcdctl member list
2f673b29c2322a99, started, etcd2, http://10.128.0.21:2380, http://10.128.0.21:2379, false
d9135718ef4e4083, started, etcd3, http://10.128.0.22:2380, http://10.128.0.22:2379, false
e52d165b45448af4, started, etcd1, http://10.128.0.20:2380, http://10.128.0.20:2379, false

Let’s put a value in our cluster by running the command below and confirm that the value has been stored in our instance:

etcd-1:~$ etcdctl put foo "Hello, etcd!"
OK
etcd-1:~$ etcdctl get foo
foo
Hello, etcd!

If we run the same get command on other nodes in the cluster, we can see that the value is the same. Our key can be accessed anywhere consistently!

etcd-2:~$ etcdctl get foo
foo
Hello, etcd!

Now, let’s put one of our VMs down, preferable the node that we used to create the ‘foo’ key.

Perform another ‘get’ operation on the remaining nodes and we can see that our key is still accessible.

etcd-3:~$ etcdctl get foo
foo
Hello, etcd!

Now that we have seen the capability of etcd, let’s build a real system for our Patroni. We will create a daemon that can run etcd automatically during startup. This will remove the burden of restarting etcd after disaster from our shoulder.

First, create a user group that will be responsible for our etcd system.

sudo adduser --system --home /var/lib/etcd --group etcd

Then, define the configuration for our new etcd service.

# On VM etcd1
sudo tee /etc/systemd/system/etcd.service << EOF
[Unit]
Description=etcd key-value store
Documentation=https://github.com/coreos/etcd
After=network.target
[Service]
User=etcd
Group=etcd
Type=notify
ExecStart=/bin/bash --login -c "etcd --name etcd1 --initial
-advertise-peer-urls http://10.128.0.20:2380 \
--data-dir /var/lib/etcd/data \
--wal-dir /var/lib/etcd/wal \
--listen-peer-urls http://10.128.0.20:2380 \
--listen-client-urls http://10.128.0.20:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.128.0.20:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd1=http://10.128.0.20:2380,etcd2=http://10.128.0.21:2380,etcd3=http://10.128.0.22:2380 \
--initial-cluster-state new"
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
EOF

# On VM etcd2
sudo tee /etc/systemd/system/etcd.service << EOF
[Unit]
Description=etcd key-value store
Documentation=https://github.com/coreos/etcd
After=network.target
[Service]
User=etcd
Group=etcd
Type=notify
ExecStart=/bin/bash --login -c "etcd --name etcd2 --initial
-advertise-peer-urls http://10.128.0.21:2380 \
--data-dir /var/lib/etcd/data \
--wal-dir /var/lib/etcd/wal \
--listen-peer-urls http://10.128.0.21:2380 \
--listen-client-urls http://10.128.0.21:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.128.0.21:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd1=http://10.128.0.20:2380,etcd2=http://10.128.0.21:2380,etcd3=http://10.128.0.22:2380 \
--initial-cluster-state new"
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
EOF
# On VM etcd3
sudo tee /etc/systemd/system/etcd.service << EOF
[Unit]
Description=etcd key-value store
Documentation=https://github.com/coreos/etcd
After=network.target
[Service]
User=etcd
Group=etcd
Type=notify
ExecStart=/bin/bash --login -c "etcd --name etcd3 --initial
-advertise-peer-urls http://10.128.0.22:2380 \
--data-dir /var/lib/etcd/data \
--wal-dir /var/lib/etcd/wal \
--listen-peer-urls http://10.128.0.22:2380 \
--listen-client-urls http://10.128.0.22:2379,http://127.0.0.1:2379 \
--advertise-client-urls http://10.128.0.22:2379 \
--initial-cluster-token etcd-cluster-1 \
--initial-cluster etcd1=http://10.128.0.20:2380,etcd2=http://10.128.0.21:2380,etcd3=http://10.128.0.22:2380 \
--initial-cluster-state new"
Restart=always
RestartSec=10s
LimitNOFILE=40000
[Install]
WantedBy=multi-user.target
EOF

Lastly, enable and start our service (on all VMs).

sudo systemctl daemon-reload
sudo systemctl enable etcd
sudo systemctl start etcd

Now that our etcd system is set, let’s start configuring Patroni.

Patroni is a python module, therefore it requires python (or python3) to run, along with several python dependencies. According to one of the repo maintainers, it is better to install the Linux ‘native’ distributions of these dependencies. So, that is what we will do. We will reuse the etcd instances created before.

Install the following dependencies on all VMs:

sudo apt -y install python3-pip
sudo apt -y install python3-psycopg2
pip3 install --upgrade setuptools
pip3 install psycopg2-binary==2.8.3

It is more convenient to run Patroni as a bash command. In order to make this possible, the newly installed module must be located somewhere visible from our PATH environment variable. To do this, run the installation as superuser

sudo -E pip3 install patroni[etcd3]

Then, append /usr/local/bin (or the appropriate location of python packages) to the PATH environment variable.

etcd-1:~$ echo $PATH
...:/usr/local/bin:/usr/bin:/bin:...

As a PostgreSQL database manager, Patroni obviously needs PostgreSQL to be installed in our VMs. We can do this using apt, but there are caveats.

etcd-1:~$ sudo apt install postgresql postgresql-contrib
.
.
.
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok
Success. You can now start the database server using:pg_ctlcluster 11 main startVer Cluster Port Status Owner Data directory Log file
11 main 5432 down postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log
update-alternatives: using /usr/share/postgresql/11/man/man1/postmaster.1.gz to provide /usr/share/man/man1/postmaster.1.gz (postmaster.1.gz) in auto mode
Setting up postgresql-contrib (11+200+deb10u4) ...
Setting up postgresql (11+200+deb10u4) ...
Processing triggers for systemd (241-7~deb10u7) ...
Processing triggers for man-db (2.8.5-2) ....
.
.
.
etcd-1:~$ pg_lsclusters
Ver Cluster Port Status Owner Data directory Log file
11 main 5432 online postgres /var/lib/postgresql/11/main /var/log/postgresql/postgresql-11-main.log

As we can see from the outputs above, PostgreSQL creates a default ‘main’ cluster after installation. This is not preferable. Patroni works better when we do not customize our PostgreSQL package and let Patroni handle it for us. This is especially true for Postgres clusters that can be easily confused with our Patroni cluster. Remove the default cluster.

etcd-1:~$ sudo pg_dropcluster 11 main --stop

Create a user group for Patroni. The new home node of this user will be used to store all files generated by Patroni, including data directory of PostgreSQL.

sudo adduser --system --home /var/lib/patroni --group patroni

Just like etcd, we should automate the lifecycle of Patroni by making it a system service.

sudo tee /etc/systemd/system/patroni.service << EOF
[Unit]
Description=Runners to orchestrate a high-availability PostgreSQL
After=syslog.target network.target
[Service]
Type=simple
User=patroni
Group=patroni
Environment="PATH=/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/usr/lib/postgresql/11/bin"
ExecStart=patroni /etc/patroni.yml
KillMode=process
TimeoutSec=30
Restart=no
[Install]
WantedBy=multi-user.target
EOF

Take note of the Environment line containing the /user/lib/postgresql/11/bin . This is the path to our pg executables, such as pg_lsclusters, pg_controldata, etc. Depending on the Postgresql we use, this path may need to be adjusted appropriately.

sudo tee /etc/patroni.yml << EOF
scope: postgres
name: $PATRONI_NAME
restapi:
listen: 0.0.0.0:8008
connect_address: $THIS_IP:8008
etcd3:
hosts: 10.128.0.20:2379,10.128.0.21:2379,10.128.0.22:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
use_pg_rewind: true
initdb:
- encoding: UTF8
- data-checksums
pg_hba:
- host replication replicator 10.184.0.3/32 md5
- host replication replicator 10.184.0.4/32 md5
- host replication replicator 10.184.0.5/32 md5
- host all all 0.0.0.0/0 md5
users:
admin:
password: admin
options:
- createrole
- createdb
postgresql:
listen: 0.0.0.0:5434
connect_address: $THIS_IP:5434
data_dir: /var/lib/patroni/data
pgpass: /var/lib/patroni/pgpass
authentication:
replication:
username: patroni_replicator
password: patroni_replicator_pass
superuser:
username: patroni_super
password: patroni_super_pass
parameters:
unix_socket_directories: '.'
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
EOF

The configuration above covers a lot of aspects of Patroni, so let us look at it from top to bottom.

The first is restapi. This is the aforementioned API used to determine the information of nodes in a Patroni cluster. We can determine if a node is the leader of a cluster by connecting through Patroni REST API. This API normally connects to port 8008, so we will just keep it at that.

Then the etcd3 configuration. As stated earlier, Patroni requires a DCS to run. To integrate etcd v3 to our Patroni cluster, we need to specify the client URL of etcd hosts. Here, we specify the client URL of all nodes to give our nodes an alternative in case one of our etcd hosts failed.

Next, the bootstrap configuration. Bootstrapping is what Patroni does when a database is not available. By default, if the specified data directory does not exist, then Patroni will create a new cluster for us using initdb. The method used can be changed here. It also initialized other important components such as DCS, specified in their own attribute. Another attribute is pg_hba. This defines the pg_hba.conf that defines which users are permitted for which use case. This configuration file will be handled by Patroni and can be customized here.

The postgresql attribute defines everything about our going-to-be-created PostgreSQL cluster. Most of the values here are self explanatory, such as data-dir that contains the location to our data directory and authentication describing the superuser and replication user just like the one we used in part 2. One attribute that is different is the port used. Based on my personal experience, port 5432 and 5433 are usually still up even after we delete the main cluster of PostgreSQL, so it is safer to use a new port than risking Patroni connecting to an already taken port.

Lastly, there is the tags attribute. It contains other miscellaneous properties regarding our cluster.

Finally, start our service

sudo systemctl daemon-reload
sudo systemctl enable patroni
sudo systemctl start patroni

Our Patroni cluster should be up now. If it doesn’t, there are several log streams that may help troubleshooting, such as journalctl -u patroni , log file of Postgres, and log file of Etcd.

Check the member list of our cluster by running patronictl -c /etc/patroni.yml list .

app-1:~$ patronictl -c /etc/patroni.yml list
+ Cluster: postgres (6977438105258196111) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+------------------+---------+---------+----+-----------+
| .... | 10.128.0.20:5434 | Replica | running | 2 | 0 |
| .... | 10.128.0.21:5434 | Leader | running | 2 | |
| .... | 10.128.0.22:5434 | Replica | running | 2 | 0 |
+--------+------------------+---------+---------+----+-----------+

Here we can see the list of all members including their descriptions: Role, health or state, and the difference between replica and leader in MegaByte.

Let’s see if all replicas are synchronized. Connect to the leader of cluster using psql and create a new database.

etcd-2:~$ psql -h 10.128.0.21 -p 5434 -U patroni_super -d postgres
Password for user patroni_super:
psql (11.12 (Debian 11.12-0+deb10u1))
Type "help" for help.
postgres=# CREATE SCHEMA patron;
CREATE SCHEMA
postgres=# SET search_path=patron;
SET
postgres=# CREATE TABLE patron(id int);
CREATE TABLE
postgres=# SELECT * FROM PATRON;
id
----
(0 rows)
postgres=#

Connect to our replicas via psql, and confirm that our table has also been replicated:

etcd-1:~$ psql -h 10.128.0.20 -p 5434 -U patroni_super -d postgres
Password for user patroni_super:
psql (11.12 (Debian 11.12-0+deb10u1))
Type "help" for help.
postgres=# SELECT * FROM PATRON.PATRON;
id
----
(0 rows)

Our database system now has the streaming replication feature, just like the one we have made at part 2. Let’s simulate a disaster by shutting down the leader. This is the result:

etcd-1:~$ patronictl -c /etc/patroni.yml list
+ Cluster: postgres (6977438105258196111) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+------------------+---------+---------+----+-----------+
| etcd1 | 10.128.0.20:5434 | Replica | running | 2 | 0 |
| etcd2 | 10.128.0.21:5434 | Replica | stopped | | unknown |
| etcd3 | 10.128.0.22:5434 | Leader | running | 3 | |
+--------+------------------+---------+---------+----+-----------+

The automatic failover system provided by Patroni has kicked in! Assume that our server remains available and transactions keep running in the leader node. Connect to the new leader using psql and insert new values in our created table.

etcd-3:~$ psql -h 10.128.0.22 -p 5434 -U patroni_super -d postgres
Password for user patroni_super:
psql (11.12 (Debian 11.12-0+deb10u1))
Type "help" for help.
postgres=# INSERT INTO PATRON.PATRON(ID) VALUES (5);
INSERT 0 1
postgres=# SELECT * FROM PATRON.PATRON;
id
----
5
(1 row)
postgres=#

We can do another sync check with the other replica

etcd-1:~$ psql -h 10.184.0.3 -p 5434 -U patroni_super -d postgres
Password for user patroni_super:
psql (11.12 (Debian 11.12-0+deb10u1))
Type "help" for help.
postgres=# SELECT * FROM patron.patron;
id
----
5
(1 row)
postgres=#

Bring the downed node back up, and check the member list once again. Most likely, our leader will not change. This is to be expected since leaders are only elected if the previous one is down.

etcd-2:~$ patronictl -c /etc/patroni.yml list
+ Cluster: postgres (6977438105258196111) ------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+--------+------------------+---------+---------+----+-----------+
| etcd1 | 10.128.0.20:5434 | Replica | running | 3 | 0 |
| etcd2 | 10.128.0.21:5434 | Replica | running | 3 | 0 |
| etcd3 | 10.128.0.22:5434 | Leader | running | 3 | |
+--------+------------------+---------+---------+----+-----------+

Check that the downed node has recovered and contains the latest database value

etcd-2:~$ psql -h 10.184.0.4 -p 5434 -U patroni_super -d postgres
Password for user patroni_super:
psql (11.12 (Debian 11.12-0+deb10u1))
Type "help" for help.
postgres=# SELECT * FROM patron.patron;
id
----
5
(1 row)
postgres=#

Even though this experiment is quite basic, it has proven the capability of Patroni to manage and recover from disasters. This is major progress towards finalizing our database system!

With the power of Patroni, we have created a database system that will automatically change our database leaders whenever disaster struck. However, this left out one small problem: How do we notify users on which database is the leader? Streaming replication does not allow replicas to handle write operations. We need a way to only allow write operations to be done in our primary server, whichever it is at the time. To do this, we can create a proxy between users and the system using a load balancer. Then, this proxy will perform a health check on each database to determine which one is the leader and redirect the user there. HAProxy is the best match for this use case.

Create a separate VM named etcd-lb with the same configuration as the other etcd instances (Debian 9, same subnet).

Create a file haproxy.cfg with the content as follows:

listen postgres_write
bind *:5432
mode tcp
option httpchk
http-check expect status 200
default-server inter 10s fall 3 rise 3 on-marked-down shutdown-sessions
server postgresql_1 10.128.0.20:5434 check port 8008
server postgresql_2 10.128.0.21:5434 check port 8008
server postgresql_3 10.128.0.22:5434 check port 8008
listen postgres_read
bind *:5433
mode tcp
balance leastconn
option pgsql-check user admin
default-server inter 10s fall 3 rise 3 on-marked-down shutdown-sessions
server postgresql_1 10.128.0.20:5434
server postgresql_2 10.128.0.21:5434
server postgresql_3 10.128.0.22:5434

Our proxy performs two kinds of redirection:

  • If a user is trying to perform write (+read) operations, then they should bind to port 5432 of our proxy. Our proxy will then redirect the user to one of the servers only if port 8008 returns the desired result.
  • If a user is trying to perform read-only operations, then they will be redirected to any one of the servers, leader or not. This is because while replicas cannot be modified, they can still be read. This will help alleviate workloads from the database leader.

Run this HAProxy, and we can redirect our user via this load balancer as an entry point.

Creating a highly available application is by no means a simple task. There are lots of factors that need to be considered to ensure that our solution does indeed make our application more resilient. This is especially true with databases, which contain persistent sensitive data and require more redundancies to be safe from possibilities of disaster. Patroni eases this problem by automating our failover method and implementing steps that almost guarantee the consistencies of our replicas. Such measures help our cause immensely in creating an application with high availability.

--

--

Inigo Ramli

Computer Science student at Universitas Indonesia. An avid competitive programmer and participated in ICPC