Increasing the Availability of an Application Part 1: Introduction to Container Orchestration

11 min readJun 7, 2021

Development of TBCare has not been a smooth experience. This is most apparent when performing quality assurance tests on our app. For months, our application has been hosted in a single computing instance in our faculty’s server. Both development and staging version of TBCare is running on this single server, and it is proving to be more trouble than it’s worth.

Running an application on a single machine means that we are at the mercy of our provider’s availability. This is troubling for several reasons, chief of which are the fact that our faculty’s servers are not really reliable. Servers will occasionally have their capabilities reduced — increasing the response time of each HTTP requests — if not just go down. Worst of all is when our host went down the night before sprint review! Clearly, our application can benefit with more availability and disaster recovery… and so begins my journey to try to tackle this issue.

There are several issues that will be addressed in this two part series blog. The main problem of these issues are: How can we improve our current software architecture to be more reliable, such that the chance of failure during disasters are minimized? Several points of improvements will be presented to answer, each requiring their own solutions. Part 1 of this article will delve into improving the availability of our application as a whole, while the second part will focus on improving the availability of our application database.

Current state

TBCare web application is a standard client-server architecture web application. Users can access our data management service through a web interface, which sends CRUD requests through our API. Currently, both the front end and back end of our applications are hosted on separate machines. The back end of TBCare is more containerized and more complex than it’s front counterpart, so we will focus on the former.

Our back end application has been containerized. Several key functionalities have been isolated from each other and have been implemented on separate containers. Load balancing, API handling, database connection, and monitoring each have their own containers.

Below is the overview on our current architecture. Edges connecting two containers implies that there will be HTTP requests exchanged between them.

Based on the diagram above, we can notice several problems right away:

All containers are hosted inside one computing instance, making that computing instance a single point of failure without any plans for disaster.
Each containers only have 1 replica. As such, should any one of the container go down, the whole application stops working. This means that each one of containers above is a potential single point of failure.

Our main task will be to secure these points of failures. The goal is to create some redundancies such that our uptime will not be disturbed even if one of the components above went down.

Solution Design

Tackling the first issue is easy (at least theoretically). Just provision more servers! Ideally, these servers should be apart from each other, even running on separate electricity grid. This should minimize the risk of both servers going down simultaneously. This can be done easily by utilizing cloud providers to create several VMs. For the sake of this article, 3 computing instance will be used.

Multiple servers will be next to useless if we do not utilize them. We can spread out our containers to each of these instances. Below is one possible distribution:

Figure 2: Multi host architecture without redundancies

This configuration is objectively better than the initial one. Should ‘Instance 2’ went down due to one reason or the other, only 3 containers will be brought down with it. Unfortunately, ‘Instance 2’ also contains the important ‘app’ container that runs all of our application logic. As such, this configuration has not really solve our problem.

Notice that some of these containers can be duplicated. At least 4 containers: ‘app’, ‘locust’, ‘locust-exporter’, and ‘nginx’ does not have any persistent data stored in them. This means — for example — that we can run two instances of ‘app’ container and they can be used interchangeably as long as they are connected to the same ‘db’ container. Container ‘app’ only performs application logic and handling HTTP requests, so we do not need to worry about data conflict. This is definitely not the case with the ‘db’ container, which is running PostgreSQL and persists it’s data within the instance it is on. This can be circumvented using log shipping or streaming replication which will be covered in part 2.

We will ignore monitoring containers for now (locust, locust-exporter, prometheus, and grafana). These containers are built for testing purposes. In case of disasters, losing these containers will be less of a concern than losing actual application up time. The result is presented in Diagram 3 with the addition of two ‘app’ container.

Creating redundancies like these serves several purposes:

It removes one single point of failure. Now, our app service will not go down if one instance is dead.
With the addition of load balancing tool such as NGINX, we can distribute traffic across three instances, alleviating Instance 2 from being the bottleneck of the whole system.

This should be good enough for now. Our service’s main functionality should remain highly available with these sets of redundancies. Let’s look at how we can implement this new architecture in practice.

Container Orchestration

Managing multiple containers communicating with multiple hosts is anything but easy. Managing in this context means:

Ensuring that all instances are up and able to communicate with each other.
Monitoring the health of each containers in all instances.
Replacing dead services if an instance is down, while making sure that the new configurations with the remaining instances still follows the high availability principle of “no single point of failure”.

These are called container orchestration and can be automated using one of the several tools available. More advanced tools like Kubernetes even offers an autoscaling functionality, creating more replicas and instances when traffic is high, and removing them when traffic is low. Because we are not dealing with cloud providers at this time and our application is not that complex, we will settle with simpler (and thus easier to use) tool called Docker Swarm.

Docker Swarm can be seen as an extension to Docker Compose. It allows multiple instances called nodes to host a set of containers in one stack. Once active, Swarm will create and monitor containers according to the configurations we have given. It tries to achieve the desired state and maintains the desired state.

For this section, we will try to host our backend application using Docker Swarm. Swarm’s capability to orchestrate multiple nodes will be demonstrated by running our stacks in Google Cloud on multiple Compute Instance.

Creating Multiple Instance

The first thing we need is several potential nodes for our Swarm. For this demonstration, I will create 3 compute instances located in the same zone (asia-southeast1-b) with Debian 10 as their OS. Each instances also has Docker installed. Network wise, each instances allows incoming HTTP and HTTPS requests and has ‘swarm’ as it’s network tag.

Figure 1: List of instances to be used. Note the internal IP as this will be important later

Configure The Firewall Rules

Swarm nodes uses several ports to communicate with each other. These ports are:

TCP port 2376 for secure Docker client communication. This port is required for Docker Machine to work. Docker Machine is used to orchestrate Docker hosts.
TCP port 2377. This port is used for communication between the nodes of a Docker Swarm or cluster. It only needs to be opened on manager nodes.
TCP and UDP port 7946 for communication among nodes (container network discovery).
UDP port 4789 for overlay network traffic (container ingress networking).

Allow our ‘swarm’ instances to communicate with each other using these ports by setting a firewall rule

Figure 2: Allow ports to be accessed internally

We can confirm whether these ports are accessible internally by running netcat -z -v -w5 <destination-ip> <target-port>. Cloud’s firewall configuration may require several minutes to apply.

Start Docker Swarm

There are two types of nodes in Docker Swarm: Manager node and Worker node. Both types of nodes are responsible with hosting containers given to them. Manager nodes are also responsible for managing the state of their swarm. They give configurations to worker nodes and each other. If there are no manager node, then the swarm is halted.

Docker allows swarm to have multiple manager nodes to prevent manager nodes from becoming a point of failure. Multiple manager nodes works with each other as follows: In order to maintain control throughout the whole swarm, a quorum must be reached. This means that out of N created manager nodes, at least ceil(N/2) manager nodes must be up. Creating 3 manager nodes means that our system will remain available if at most 1 node is down at any moment in time. In this scenario, we have exactly 3 instances. Therefore, it is best to promote all of them as manager nodes to prevent bottleneck.

SSH into one of your instances and start docker swarm by running docker swarm init.

swarm-1:~$ sudo docker swarm init
Swarm initialized: current node (ms75d68mvm3hon3v1f4sxp8q3) is now a manager.To add a worker to this swarm, run the following command:docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-0fnnxtz8ipkocq4xb7bey3gz6 10.148.0.14:2377To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.swarm-1:~$

Swarm will give you a command to be run on nodes you wish to incorporate into the swarm. The token above is for creating worker nodes. To get tokens for manager nodes, run docker swarm join-token manager .

swarm-1:~$ sudo docker swarm join-token manager
To add a manager to this swarm, run the following command:docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-8xre8whcky6u4z6ii9pscwq64 10.148.0.14:2377swarm-1:~$

Go to other instances and run the command to join the swarm. Below is an example of confirmation message after running the given command above.

swarm-3:~$ sudo docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-8xre8whcky6u4z6ii9pscwq64 10.148.0.14:2377
This node joined a swarm as a manager.

Confirm that all nodes are active and reachable by running docker node ls

swarm-1:~$ sudo docker node ls
ID                            HOSTNAME   STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
ms75d68mvm3hon3v1f4sxp8q3 *   swarm-1    Ready     Active         Leader           20.10.7
q80oz2uz4j10ji1g0xy40osuj     swarm-2    Ready     Active         Reachable        20.10.7
8rbivigoc70diru2fw2al6zsy     swarm-3    Ready     Active         Reachable        20.10.7

Obtain Configuration File

There are several ways to give stack configurations for Docker Swarm. One of them is by reusing docker-compose.yml file. TBCare already have a compose file that we can use. However, we need to edit several things first:

Compose versions should be at least 3 (e.g. 3.9).
Network type should be overlay instead of the default bridge type.
Add new attribute deploy: replicas: 3 to the appcontainer configuration. This prompts Swarm to create three replicas of the app container.

Once everything is set, run docker stack deploy --compose-file=<compose_file_path> <stack_name>

swarm-1:~/neza-backend$ sudo docker stack deploy --compose-file=./docker-compose.yml tbcare
Creating network tbcare_monitoring
Creating network tbcare_backend
Creating service tbcare_nginx
Creating service tbcare_app
Creating service tbcare_db
Creating service tbcare_locust
Creating service tbcare_locust-exporter
Creating service tbcare_prometheus
Creating service tbcare_grafana
swarm-1:~/neza-backend$

You can see the status of containers using docker service ls

swarm-1:~/neza-backend$ sudo docker service ls
ID             NAME                     MODE         REPLICAS   IMAGE                                 PORTS
l2b3llx6awj8   tbcare_app               replicated   3/3        tbcare/backend:staging                
5ql73hh68xoq   tbcare_db                replicated   1/1        postgres:13.2-alpine                  
9llap7a8b6sq   tbcare_grafana           replicated   1/1        grafana/grafana:latest                
wv4ax3i74wuw   tbcare_locust            replicated   1/1        tbcare/locust:development             
ae9q4bpjw24s   tbcare_locust-exporter   replicated   1/1        containersol/locust_exporter:latest   
mvs03nc2imrx   tbcare_nginx             replicated   1/1        nginx:1.19-alpine                     *:80->80/tcp
pkhrluqgzh1w   tbcare_prometheus        replicated   1/1        prom/prometheus:latest
swarm-1:~/neza-backend$

It may take several minutes for all container to be up and running. If a container is still not up after that (Indicated by unstable values of REPLICAS), we can check the container (service) log at docker service logs <service_name>.

Our Swarm should be up and running now! Let’s try accessing the Load Testing tool by going to http://<swarm-1-IP>/locust . Below is the load testing page accessed from swarm-1’s IP. However, you can also access this page using IP’s from other instances in the swarm. The routing mesh provided by Docker means that all nodes can receive incoming requests from published ports, making the swarm accessible from any node.

Figure 3: Load testing accessed from swarm-1

In total, there should be a total of 9 containers running in the whole swarm. Now, run docker ps on any instance.

swarm-1:~/neza-backend$ sudo docker ps
CONTAINER ID   IMAGE                                 COMMAND                  CREATED          STATUS          PORTS                NAMES
4369028ab9ce   nginx:1.19-alpine                     "/docker-entrypoint.…"   14 minutes ago   Up 14 minutes   80/tcp               tbcare_nginx.1.fjwsx0fqqkknkyle1gax6gp54
44e83e1881c0   tbcare/locust:development             "locust -f /mnt/locu…"   14 minutes ago   Up 14 minutes   5557/tcp, 8089/tcp   tbcare_locust.1.569ft1g7uj8un8bf80wbtzcmr
97c302452df6   grafana/grafana:latest                "/run.sh"                17 minutes ago   Up 17 minutes   3000/tcp             tbcare_grafana.1.l1bkr6qs69manobj1lfg49c13
f2637c13f9da   prom/prometheus:latest                "/bin/prometheus --c…"   19 minutes ago   Up 18 minutes   9090/tcp             tbcare_prometheus.1.h7gzwxntrara19co55fcdtqwf
2473cb5ed214   containersol/locust_exporter:latest   "locust_exporter"        19 minutes ago   Up 19 minutes   9646/tcp             tbcare_locust-exporter.1.r5z5augifqbuj2tcran87wfji

From the example output above, there are only 5 containers running inside the instance swarm-1 . Where is the rest? They are being run outside of swarm-1. Some of them are in swarm-2, and others are in swarm-3.

Simulate a Disaster

It is time to see the power of container orchestration tools the likes of Docker Swarm. Let’s turn one of the instances off. Preferably one with a lot of containers on. Since there are 3 manager nodes, losing one should in theory not crash the whole system. We can do this easily by clicking stop on one of the instances.

Connect to one of the remaining instances and get the list of services status again.

swarm-2:~$ sudo docker service ls
ID             NAME                     MODE         REPLICAS   IMAGE                                 PORTSl2b3llx6awj8   tbcare_app               replicated   1/3        tbcare/backend:staging                5ql73hh68xoq   tbcare_db                replicated   0/1        postgres:13.2-alpine                  9llap7a8b6sq   tbcare_grafana           replicated   1/1        grafana/grafana:latest                wv4ax3i74wuw   tbcare_locust            replicated   0/1        tbcare/locust:development             ae9q4bpjw24s   tbcare_locust-exporter   replicated   0/1        containersol/locust_exporter:latest   mvs03nc2imrx   tbcare_nginx             replicated   1/1        nginx:1.19-alpine                     *:80->80/tcppkhrluqgzh1w   tbcare_prometheus        replicated   0/1        prom/prometheus:latest

You can see that Swarm is attempting to rebuild containers that have been lost after the simulated disaster. This is done automatically without any need of intervention. Our responsibility now is to restart the lost instance as soon as possible. Docker Swarm will keep the system up and running until then.

Before restarting swarm-1 , let’s see if we can still access our app. Because locust was not replicated, this may take up to a minute.

Conclusion

High availability is an important aspect in software maintenance and quality. This can be done by creating replication and redundancies across the system. Achieving this requires careful planning and a set of complex server structure. We have shown that container orchestration tools can be used to make development and maintenance of such structures simpler to manage. However, there are still another major point of failure, and that is our database container. We will see how we can tackle this using log shipping and/or streaming replication in part 2.

Increasing the Availability of an Application Part 1: Introduction to Container Orchestration

Written by Inigo Ramli