Increasing the Availability of an Application Part 1: Introduction to Container Orchestration

Development of TBCare has not been a smooth experience. This is most apparent when performing quality assurance tests on our app. For months, our application has been hosted in a single computing instance in our faculty’s server. Both development and staging version of TBCare is running on this single server, and it is proving to be more trouble than it’s worth.

Running an application on a single machine means that we are at the mercy of our provider’s availability. This is troubling for several reasons, chief of which are the fact that our faculty’s servers are not really reliable. Servers will occasionally have their capabilities reduced — increasing the response time of each HTTP requests — if not just go down. Worst of all is when our host went down the night before sprint review! Clearly, our application can benefit with more availability and disaster recovery… and so begins my journey to try to tackle this issue.

There are several issues that will be addressed in this two part series blog. The main problem of these issues are: How can we improve our current software architecture to be more reliable, such that the chance of failure during disasters are minimized? Several points of improvements will be presented to answer, each requiring their own solutions. Part 1 of this article will delve into improving the availability of our application as a whole, while the second part will focus on improving the availability of our application database.

Current state

TBCare web application is a standard client-server architecture web application. Users can access our data management service through a web interface, which sends CRUD requests through our API. Currently, both the front end and back end of our applications are hosted on separate machines. The back end of TBCare is more containerized and more complex than it’s front counterpart, so we will focus on the former.

Our back end application has been containerized. Several key functionalities have been isolated from each other and have been implemented on separate containers. Load balancing, API handling, database connection, and monitoring each have their own containers.

Below is the overview on our current architecture. Edges connecting two containers implies that there will be HTTP requests exchanged between them.

Diagram 1: Current Architecture

Based on the diagram above, we can notice several problems right away:

  1. All containers are hosted inside one computing instance, making that computing instance a single point of failure without any plans for disaster.

Our main task will be to secure these points of failures. The goal is to create some redundancies such that our uptime will not be disturbed even if one of the components above went down.

Solution Design

Tackling the first issue is easy (at least theoretically). Just provision more servers! Ideally, these servers should be apart from each other, even running on separate electricity grid. This should minimize the risk of both servers going down simultaneously. This can be done easily by utilizing cloud providers to create several VMs. For the sake of this article, 3 computing instance will be used.

Multiple servers will be next to useless if we do not utilize them. We can spread out our containers to each of these instances. Below is one possible distribution:

Figure 2: Multi host architecture without redundancies

This configuration is objectively better than the initial one. Should ‘Instance 2’ went down due to one reason or the other, only 3 containers will be brought down with it. Unfortunately, ‘Instance 2’ also contains the important ‘app’ container that runs all of our application logic. As such, this configuration has not really solve our problem.

Notice that some of these containers can be duplicated. At least 4 containers: ‘app’, ‘locust’, ‘locust-exporter’, and ‘nginx’ does not have any persistent data stored in them. This means — for example — that we can run two instances of ‘app’ container and they can be used interchangeably as long as they are connected to the same ‘db’ container. Container ‘app’ only performs application logic and handling HTTP requests, so we do not need to worry about data conflict. This is definitely not the case with the ‘db’ container, which is running PostgreSQL and persists it’s data within the instance it is on. This can be circumvented using log shipping or streaming replication which will be covered in part 2.

We will ignore monitoring containers for now (locust, locust-exporter, prometheus, and grafana). These containers are built for testing purposes. In case of disasters, losing these containers will be less of a concern than losing actual application up time. The result is presented in Diagram 3 with the addition of two ‘app’ container.

Diagram 3: Our final solution for now

Creating redundancies like these serves several purposes:

  • It removes one single point of failure. Now, our app service will not go down if one instance is dead.

This should be good enough for now. Our service’s main functionality should remain highly available with these sets of redundancies. Let’s look at how we can implement this new architecture in practice.

Container Orchestration

Managing multiple containers communicating with multiple hosts is anything but easy. Managing in this context means:

  1. Ensuring that all instances are up and able to communicate with each other.

These are called container orchestration and can be automated using one of the several tools available. More advanced tools like Kubernetes even offers an autoscaling functionality, creating more replicas and instances when traffic is high, and removing them when traffic is low. Because we are not dealing with cloud providers at this time and our application is not that complex, we will settle with simpler (and thus easier to use) tool called Docker Swarm.

Docker Swarm can be seen as an extension to Docker Compose. It allows multiple instances called nodes to host a set of containers in one stack. Once active, Swarm will create and monitor containers according to the configurations we have given. It tries to achieve the desired state and maintains the desired state.

For this section, we will try to host our backend application using Docker Swarm. Swarm’s capability to orchestrate multiple nodes will be demonstrated by running our stacks in Google Cloud on multiple Compute Instance.

Creating Multiple Instance

The first thing we need is several potential nodes for our Swarm. For this demonstration, I will create 3 compute instances located in the same zone (asia-southeast1-b) with Debian 10 as their OS. Each instances also has Docker installed. Network wise, each instances allows incoming HTTP and HTTPS requests and has ‘swarm’ as it’s network tag.

Figure 1: List of instances to be used. Note the internal IP as this will be important later

Configure The Firewall Rules

Swarm nodes uses several ports to communicate with each other. These ports are:

  • TCP port 2376 for secure Docker client communication. This port is required for Docker Machine to work. Docker Machine is used to orchestrate Docker hosts.

Allow our ‘swarm’ instances to communicate with each other using these ports by setting a firewall rule

Figure 2: Allow ports to be accessed internally

We can confirm whether these ports are accessible internally by running netcat -z -v -w5 <destination-ip> <target-port>. Cloud’s firewall configuration may require several minutes to apply.

Start Docker Swarm

There are two types of nodes in Docker Swarm: Manager node and Worker node. Both types of nodes are responsible with hosting containers given to them. Manager nodes are also responsible for managing the state of their swarm. They give configurations to worker nodes and each other. If there are no manager node, then the swarm is halted.

Docker allows swarm to have multiple manager nodes to prevent manager nodes from becoming a point of failure. Multiple manager nodes works with each other as follows: In order to maintain control throughout the whole swarm, a quorum must be reached. This means that out of N created manager nodes, at least ceil(N/2) manager nodes must be up. Creating 3 manager nodes means that our system will remain available if at most 1 node is down at any moment in time. In this scenario, we have exactly 3 instances. Therefore, it is best to promote all of them as manager nodes to prevent bottleneck.

SSH into one of your instances and start docker swarm by running docker swarm init.

swarm-1:~$ sudo docker swarm init
Swarm initialized: current node (ms75d68mvm3hon3v1f4sxp8q3) is now a manager.
To add a worker to this swarm, run the following command:docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-0fnnxtz8ipkocq4xb7bey3gz6 10.148.0.14:2377To add a manager to this swarm, run 'docker swarm join-token manager' and follow the instructions.swarm-1:~$

Swarm will give you a command to be run on nodes you wish to incorporate into the swarm. The token above is for creating worker nodes. To get tokens for manager nodes, run docker swarm join-token manager .

swarm-1:~$ sudo docker swarm join-token manager
To add a manager to this swarm, run the following command:
docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-8xre8whcky6u4z6ii9pscwq64 10.148.0.14:2377swarm-1:~$

Go to other instances and run the command to join the swarm. Below is an example of confirmation message after running the given command above.

swarm-3:~$ sudo docker swarm join --token SWMTKN-1-0rngjglb2svjvibe1zsc2o213eys3qxoq5rcxggnij6osvtn2y-8xre8whcky6u4z6ii9pscwq64 10.148.0.14:2377
This node joined a swarm as a manager.

Confirm that all nodes are active and reachable by running docker node ls

swarm-1:~$ sudo docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
ms75d68mvm3hon3v1f4sxp8q3 * swarm-1 Ready Active Leader 20.10.7
q80oz2uz4j10ji1g0xy40osuj swarm-2 Ready Active Reachable 20.10.7
8rbivigoc70diru2fw2al6zsy swarm-3 Ready Active Reachable 20.10.7

Obtain Configuration File

There are several ways to give stack configurations for Docker Swarm. One of them is by reusing docker-compose.yml file. TBCare already have a compose file that we can use. However, we need to edit several things first:

  • Compose versions should be at least 3 (e.g. 3.9).

Once everything is set, run docker stack deploy --compose-file=<compose_file_path> <stack_name>

swarm-1:~/neza-backend$ sudo docker stack deploy --compose-file=./docker-compose.yml tbcare
Creating network tbcare_monitoring
Creating network tbcare_backend
Creating service tbcare_nginx
Creating service tbcare_app
Creating service tbcare_db
Creating service tbcare_locust
Creating service tbcare_locust-exporter
Creating service tbcare_prometheus
Creating service tbcare_grafana
swarm-1:~/neza-backend$

You can see the status of containers using docker service ls

swarm-1:~/neza-backend$ sudo docker service ls
ID NAME MODE REPLICAS IMAGE PORTS
l2b3llx6awj8 tbcare_app replicated 3/3 tbcare/backend:staging
5ql73hh68xoq tbcare_db replicated 1/1 postgres:13.2-alpine
9llap7a8b6sq tbcare_grafana replicated 1/1 grafana/grafana:latest
wv4ax3i74wuw tbcare_locust replicated 1/1 tbcare/locust:development
ae9q4bpjw24s tbcare_locust-exporter replicated 1/1 containersol/locust_exporter:latest
mvs03nc2imrx tbcare_nginx replicated 1/1 nginx:1.19-alpine *:80->80/tcp
pkhrluqgzh1w tbcare_prometheus replicated 1/1 prom/prometheus:latest
swarm-1:~/neza-backend$

It may take several minutes for all container to be up and running. If a container is still not up after that (Indicated by unstable values of REPLICAS), we can check the container (service) log at docker service logs <service_name>.

Our Swarm should be up and running now! Let’s try accessing the Load Testing tool by going to http://<swarm-1-IP>/locust . Below is the load testing page accessed from swarm-1’s IP. However, you can also access this page using IP’s from other instances in the swarm. The routing mesh provided by Docker means that all nodes can receive incoming requests from published ports, making the swarm accessible from any node.

Figure 3: Load testing accessed from swarm-1

In total, there should be a total of 9 containers running in the whole swarm. Now, run docker ps on any instance.

swarm-1:~/neza-backend$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4369028ab9ce nginx:1.19-alpine "/docker-entrypoint.…" 14 minutes ago Up 14 minutes 80/tcp tbcare_nginx.1.fjwsx0fqqkknkyle1gax6gp54
44e83e1881c0 tbcare/locust:development "locust -f /mnt/locu…" 14 minutes ago Up 14 minutes 5557/tcp, 8089/tcp tbcare_locust.1.569ft1g7uj8un8bf80wbtzcmr
97c302452df6 grafana/grafana:latest "/run.sh" 17 minutes ago Up 17 minutes 3000/tcp tbcare_grafana.1.l1bkr6qs69manobj1lfg49c13
f2637c13f9da prom/prometheus:latest "/bin/prometheus --c…" 19 minutes ago Up 18 minutes 9090/tcp tbcare_prometheus.1.h7gzwxntrara19co55fcdtqwf
2473cb5ed214 containersol/locust_exporter:latest "locust_exporter" 19 minutes ago Up 19 minutes 9646/tcp tbcare_locust-exporter.1.r5z5augifqbuj2tcran87wfji

From the example output above, there are only 5 containers running inside the instance swarm-1 . Where is the rest? They are being run outside of swarm-1. Some of them are in swarm-2, and others are in swarm-3.

Simulate a Disaster

It is time to see the power of container orchestration tools the likes of Docker Swarm. Let’s turn one of the instances off. Preferably one with a lot of containers on. Since there are 3 manager nodes, losing one should in theory not crash the whole system. We can do this easily by clicking stop on one of the instances.

Connect to one of the remaining instances and get the list of services status again.

swarm-2:~$ sudo docker service ls
ID NAME MODE REPLICAS IMAGE PORTSl2b3llx6awj8 tbcare_app replicated 1/3 tbcare/backend:staging 5ql73hh68xoq tbcare_db replicated 0/1 postgres:13.2-alpine 9llap7a8b6sq tbcare_grafana replicated 1/1 grafana/grafana:latest wv4ax3i74wuw tbcare_locust replicated 0/1 tbcare/locust:development ae9q4bpjw24s tbcare_locust-exporter replicated 0/1 containersol/locust_exporter:latest mvs03nc2imrx tbcare_nginx replicated 1/1 nginx:1.19-alpine *:80->80/tcppkhrluqgzh1w tbcare_prometheus replicated 0/1 prom/prometheus:latest

You can see that Swarm is attempting to rebuild containers that have been lost after the simulated disaster. This is done automatically without any need of intervention. Our responsibility now is to restart the lost instance as soon as possible. Docker Swarm will keep the system up and running until then.

Before restarting swarm-1 , let’s see if we can still access our app. Because locust was not replicated, this may take up to a minute.

Figure 4: Locust after swarm-1 is down

Conclusion

High availability is an important aspect in software maintenance and quality. This can be done by creating replication and redundancies across the system. Achieving this requires careful planning and a set of complex server structure. We have shown that container orchestration tools can be used to make development and maintenance of such structures simpler to manage. However, there are still another major point of failure, and that is our database container. We will see how we can tackle this using log shipping and/or streaming replication in part 2.

Computer Science student at Universitas Indonesia. An avid competitive programmer and participated in ICPC