Increasing the Availability of an Application Part 1: Introduction to Container Orchestration
Development of TBCare has not been a smooth experience. This is most apparent when performing quality assurance tests on our app. For months, our application has been hosted in a single computing instance in our faculty’s server. Both development and staging version of TBCare is running on this single server, and it is proving to be more trouble than it’s worth.
Running an application on a single machine means that we are at the mercy of our provider’s availability. This is troubling for several reasons, chief of which are the fact that our faculty’s servers are not really reliable. Servers will occasionally have their capabilities reduced — increasing the response time of each HTTP requests — if not just go down. Worst of all is when our host went down the night before sprint review! Clearly, our application can benefit with more availability and disaster recovery… and so begins my journey to try to tackle this issue.
There are several issues that will be addressed in this two part series blog. The main problem of these issues are: How can we improve our current software architecture to be more reliable, such that the chance of failure during disasters are minimized? Several points of improvements will be presented to answer, each requiring their own solutions. Part 1 of this article will delve into improving the availability of our application as a whole, while the second part will focus on improving the availability of our application database.
TBCare web application is a standard client-server architecture web application. Users can access our data management service through a web interface, which sends CRUD requests through our API. Currently, both the front end and back end of our applications are hosted on separate machines. The back end of TBCare is more containerized and more complex than it’s front counterpart, so we will focus on the former.
Our back end application has been containerized. Several key functionalities have been isolated from each other and have been implemented on separate containers. Load balancing, API handling, database connection, and monitoring each have their own containers.
Below is the overview on our current architecture. Edges connecting two containers implies that there will be HTTP requests exchanged between them.
Based on the diagram above, we can notice several problems right away:
- All containers are hosted inside one computing instance, making that computing instance a single point of failure without any plans for disaster.
- Each containers only have 1 replica. As such, should any one of the container go down, the whole application stops working. This means that each one of containers above is a potential single point of failure.
Our main task will be to secure these points of failures. The goal is to create some redundancies such that our uptime will not be disturbed even if one of the components above went down.
Tackling the first issue is easy (at least theoretically). Just provision more servers! Ideally, these servers should be apart from each other, even running on separate electricity grid. This should minimize the risk of both servers going down simultaneously. This can be done easily by utilizing cloud providers to create several VMs. For the sake of this article, 3 computing instance will be used.
Multiple servers will be next to useless if we do not utilize them. We can spread out our containers to each of these instances. Below is one possible distribution:
This configuration is objectively better than the initial one. Should ‘Instance 2’ went down due to one reason or the other, only 3 containers will be brought down with it. Unfortunately, ‘Instance 2’ also contains the important ‘app’ container that runs all of our application logic. As such, this configuration has not really solve our problem.
Notice that some of these containers can be duplicated. At least 4 containers: ‘app’, ‘locust’, ‘locust-exporter’, and ‘nginx’ does not have any persistent data stored in them. This means — for example — that we can run two instances of ‘app’ container and they can be used interchangeably as long as they are connected to the same ‘db’ container. Container ‘app’ only performs application logic and handling HTTP requests, so we do not need to worry about data conflict. This is definitely not the case with the ‘db’ container, which is running PostgreSQL and persists it’s data within the instance it is on. This can be circumvented using log shipping or streaming replication which will be covered in part 2.
We will ignore monitoring containers for now (locust, locust-exporter, prometheus, and grafana). These containers are built for testing purposes. In case of disasters, losing these containers will be less of a concern than losing actual application up time. The result is presented in Diagram 3 with the addition of two ‘app’ container.
Creating redundancies like these serves several purposes:
- It removes one single point of failure. Now, our app service will not go down if one instance is dead.
- With the addition of load balancing tool such as NGINX, we can distribute traffic across three instances, alleviating Instance 2 from being the bottleneck of the whole system.
This should be good enough for now. Our service’s main functionality should remain highly available with these sets of redundancies. Let’s look at how we can implement this new architecture in practice.
Managing multiple containers communicating with multiple hosts is anything but easy. Managing in this context means:
- Ensuring that all instances are up and able to communicate with each other.
- Monitoring the health of each containers in all instances.
- Replacing dead services if an instance is down, while making sure that the new configurations with the remaining instances still follows the high availability principle of “no single point of failure”.
These are called container orchestration and can be automated using one of the several tools available. More advanced tools like Kubernetes even offers an autoscaling functionality, creating more replicas and instances when traffic is high, and removing them when traffic is low. Because we are not dealing with cloud providers at this time and our application is not that complex, we will settle with simpler (and thus easier to use) tool called Docker Swarm.
Docker Swarm can be seen as an extension to Docker Compose. It allows multiple instances called nodes to host a set of containers in one stack. Once active, Swarm will create and monitor containers according to the configurations we have given. It tries to achieve the desired state and maintains the desired state.
For this section, we will try to host our backend application using Docker Swarm. Swarm’s capability to orchestrate multiple nodes will be demonstrated by running our stacks in Google Cloud on multiple Compute Instance.
Creating Multiple Instance
The first thing we need is several potential nodes for our Swarm. For this demonstration, I will create 3 compute instances located in the same zone (asia-southeast1-b) with Debian 10 as their OS. Each instances also has Docker installed. Network wise, each instances allows incoming HTTP and HTTPS requests and has ‘swarm’ as it’s network tag.
Configure The Firewall Rules
Swarm nodes uses several ports to communicate with each other. These ports are:
- TCP port
2376for secure Docker client communication. This port is required for Docker Machine to work. Docker Machine is used to orchestrate Docker hosts.
- TCP port
2377. This port is used for communication between the nodes of a Docker Swarm or cluster. It only needs to be opened on manager nodes.
- TCP and UDP port
7946for communication among nodes (container network discovery).
- UDP port
4789for overlay network traffic (container ingress networking).
Allow our ‘swarm’ instances to communicate with each other using these ports by setting a firewall rule
We can confirm whether these ports are accessible internally by running
netcat -z -v -w5 <destination-ip> <target-port>. Cloud’s firewall configuration may require several minutes to apply.
Start Docker Swarm
There are two types of nodes in Docker Swarm: Manager node and Worker node. Both types of nodes are responsible with hosting containers given to them. Manager nodes are also responsible for managing the state of their swarm. They give configurations to worker nodes and each other. If there are no manager node, then the swarm is halted.
Docker allows swarm to have multiple manager nodes to prevent manager nodes from becoming a point of failure. Multiple manager nodes works with each other as follows: In order to maintain control throughout the whole swarm, a quorum must be reached. This means that out of N created manager nodes, at least ceil(N/2) manager nodes must be up. Creating 3 manager nodes means that our system will remain available if at most 1 node is down at any moment in time. In this scenario, we have exactly 3 instances. Therefore, it is best to promote all of them as manager nodes to prevent bottleneck.
SSH into one of your instances and start docker swarm by running
docker swarm init.
swarm-1:~$ sudo docker swarm init
Swarm initialized: current node (ms75d68mvm3hon3v1f4sxp8q3) is now a manager.To add a worker to this swarm, run the following command:docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-0fnnxtz8ipkocq4xb7bey3gz6 10.148.0.14:2377To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.swarm-1:~$
Swarm will give you a command to be run on nodes you wish to incorporate into the swarm. The token above is for creating worker nodes. To get tokens for manager nodes, run
docker swarm join-token manager .
swarm-1:~$ sudo docker swarm join-token manager
To add a manager to this swarm, run the following command:docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-8xre8whcky6u4z6ii9pscwq64 10.148.0.14:2377swarm-1:~$
Go to other instances and run the command to join the swarm. Below is an example of confirmation message after running the given command above.
swarm-3:~$ sudo docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-8xre8whcky6u4z6ii9pscwq64 10.148.0.14:2377
This node joined a swarm as a manager.
Confirm that all nodes are active and reachable by running
docker node ls
swarm-1:~$ sudo docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
ms75d68mvm3hon3v1f4sxp8q3 * swarm-1 Ready Active Leader 20.10.7
q80oz2uz4j10ji1g0xy40osuj swarm-2 Ready Active Reachable 20.10.7
8rbivigoc70diru2fw2al6zsy swarm-3 Ready Active Reachable 20.10.7
Obtain Configuration File
There are several ways to give stack configurations for Docker Swarm. One of them is by reusing
docker-compose.yml file. TBCare already have a compose file that we can use. However, we need to edit several things first:
- Compose versions should be at least 3 (e.g. 3.9).
- Network type should be
overlayinstead of the default
- Add new attribute
deploy: replicas: 3to the
appcontainer configuration. This prompts Swarm to create three replicas of the
Once everything is set, run
docker stack deploy --compose-file=<compose_file_path> <stack_name>
swarm-1:~/neza-backend$ sudo docker stack deploy --compose-file=./docker-compose.yml tbcare
Creating network tbcare_monitoring
Creating network tbcare_backend
Creating service tbcare_nginx
Creating service tbcare_app
Creating service tbcare_db
Creating service tbcare_locust
Creating service tbcare_locust-exporter
Creating service tbcare_prometheus
Creating service tbcare_grafana
You can see the status of containers using
docker service ls
swarm-1:~/neza-backend$ sudo docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
l2b3llx6awj8 tbcare_app replicated 3/3 tbcare/backend:staging
5ql73hh68xoq tbcare_db replicated 1/1 postgres:13.2-alpine
9llap7a8b6sq tbcare_grafana replicated 1/1 grafana/grafana:latest
wv4ax3i74wuw tbcare_locust replicated 1/1 tbcare/locust:development
ae9q4bpjw24s tbcare_locust-exporter replicated 1/1 containersol/locust_exporter:latest
mvs03nc2imrx tbcare_nginx replicated 1/1 nginx:1.19-alpine *:80->80/tcp
pkhrluqgzh1w tbcare_prometheus replicated 1/1 prom/prometheus:latest
It may take several minutes for all container to be up and running. If a container is still not up after that (Indicated by unstable values of REPLICAS), we can check the container (service) log at
docker service logs <service_name>.
Our Swarm should be up and running now! Let’s try accessing the Load Testing tool by going to
http://<swarm-1-IP>/locust . Below is the load testing page accessed from swarm-1’s IP. However, you can also access this page using IP’s from other instances in the swarm. The routing mesh provided by Docker means that all nodes can receive incoming requests from published ports, making the swarm accessible from any node.
In total, there should be a total of 9 containers running in the whole swarm. Now, run
docker ps on any instance.
swarm-1:~/neza-backend$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4369028ab9ce nginx:1.19-alpine "/docker-entrypoint.…" 14 minutes ago Up 14 minutes 80/tcp tbcare_nginx.1.fjwsx0fqqkknkyle1gax6gp54
44e83e1881c0 tbcare/locust:development "locust -f /mnt/locu…" 14 minutes ago Up 14 minutes 5557/tcp, 8089/tcp tbcare_locust.1.569ft1g7uj8un8bf80wbtzcmr
97c302452df6 grafana/grafana:latest "/run.sh" 17 minutes ago Up 17 minutes 3000/tcp tbcare_grafana.1.l1bkr6qs69manobj1lfg49c13
f2637c13f9da prom/prometheus:latest "/bin/prometheus --c…" 19 minutes ago Up 18 minutes 9090/tcp tbcare_prometheus.1.h7gzwxntrara19co55fcdtqwf
2473cb5ed214 containersol/locust_exporter:latest "locust_exporter" 19 minutes ago Up 19 minutes 9646/tcp tbcare_locust-exporter.1.r5z5augifqbuj2tcran87wfji
From the example output above, there are only 5 containers running inside the instance
swarm-1 . Where is the rest? They are being run outside of
swarm-1. Some of them are in
swarm-2, and others are in
Simulate a Disaster
It is time to see the power of container orchestration tools the likes of Docker Swarm. Let’s turn one of the instances off. Preferably one with a lot of containers on. Since there are 3 manager nodes, losing one should in theory not crash the whole system. We can do this easily by clicking stop on one of the instances.
Connect to one of the remaining instances and get the list of services status again.
swarm-2:~$ sudo docker service ls
ID NAME MODE REPLICAS IMAGE PORTSl2b3llx6awj8 tbcare_app replicated 1/3 tbcare/backend:staging 5ql73hh68xoq tbcare_db replicated 0/1 postgres:13.2-alpine 9llap7a8b6sq tbcare_grafana replicated 1/1 grafana/grafana:latest wv4ax3i74wuw tbcare_locust replicated 0/1 tbcare/locust:development ae9q4bpjw24s tbcare_locust-exporter replicated 0/1 containersol/locust_exporter:latest mvs03nc2imrx tbcare_nginx replicated 1/1 nginx:1.19-alpine *:80->80/tcppkhrluqgzh1w tbcare_prometheus replicated 0/1 prom/prometheus:latest
You can see that Swarm is attempting to rebuild containers that have been lost after the simulated disaster. This is done automatically without any need of intervention. Our responsibility now is to restart the lost instance as soon as possible. Docker Swarm will keep the system up and running until then.
swarm-1 , let’s see if we can still access our app. Because locust was not replicated, this may take up to a minute.
High availability is an important aspect in software maintenance and quality. This can be done by creating replication and redundancies across the system. Achieving this requires careful planning and a set of complex server structure. We have shown that container orchestration tools can be used to make development and maintenance of such structures simpler to manage. However, there are still another major point of failure, and that is our database container. We will see how we can tackle this using log shipping and/or streaming replication in part 2.