banner cross

By using our website, you accept that we use cookies to perform analytics and produce content & ads tailored to your interests.
Read our policy on cookies usage.

How to automate the monitoring of our 200 customer stacks in one deployment?

In the beginning….

At Toucan Toco, each customer has its own dedicated instance, and therefore its own stack:MongoDB dedicated,Redis dedicated, process poolsgunicorn /celery dedicated. With each new project or pilot, a new instance is deployed.

Typical architecture

Very quickly to deploy a new stack, we made a suite of playbooks Ansible. They are in charge of provisioning the target machine with the stack and all its dependencies, adjusting the environment (such as fail2ban rules, etc…), but nothing was managed for our supervision.

Rather than managing the monitoring of the different bricks of each instance with an agent-based tool like Nagios (to be deployed, configured and updated on each server), the first version of our monitoring was done using the free trial proposed by StatusCake. This service allows you to ping a given URL at regular intervals, and to send alerts via Slack or email.

The problem: doing everything manually

As this part was not automated, the creation of the tests StatusCake was done manually by going to the web interface as described in our documentation for initializing a new stack. Creating checks on the[StatusCake] interface (https://www.statuscake.com/) is relatively simple but remains tedious with the multiplication of projects.

Each time you have to connect to the interface, find the credentials, be sure that you create the test with the right options like the other tests, click, click, click and click again… As a result, no one wants to do it and we are exposed to mistakes.

Charlie Chaplin - Modern Times

The penalty is the same when you have to update the tests (for example, to add conditions or parameters). Scaling might become compromised quickly.

The solution with Ansible!

With the multiplication of customers and projects, the need to automate the creation and updating of monitoring has therefore quickly become apparent. Knowing that the deployment and update of our stacks are already automated by our scripts Ansible… Why not delegate the creation of these checks directly to it?

In our case, we would like to have different types of checks concerning the health of our services:

These two points match the distinction between liveness and readiness, well described in an Octo article: [Liveness and readiness probes: Put intelligence into your clusters] (https://blog.octo.com/liveness-et-readiness-probes-mettez-de-lintelligence-dans-vos-clusters/):

The implementation of these checks requires the provision of 2 dedicated routes, on our service:

@app.route('/liveness')
def liveness():
    return "OK", 200

@app.route('/readiness')
def readiness():
    try:
        g.redis_connection.ping()
        g.mongo_connection.server_info()
    except (pymongo.errors.ConnectionFailure, redis.ConnectionError):
        return "KO", 500
    return "OK", 200

As for scripts Ansible, we have created a custom module in python ansible-statuscake (forked fromp404/ansible-statuscake which is no longer maintained). This module is used in this way:

- name: Create StatusCake test
  local_action:
    module:        status_cake_test
    username:      "my-user"
    api_key:       "my-api-key"
    name:          "My service check"
    url:           "https://myservice.example.com"
    state:         "present"
    test_type:     "HTTP"
    check_rate:    60  # on check toutes les minutes

In order for Ansible to find this custom module, you must place the python script status_cake_test.py in the library folder at the root of the playbook.

*Technical note: here, since we use a `local_action’, the execution takes place on the machine that starts the deployment, not on the target machine. In such a context, we can therefore benefit from the python interpreter and pip packages of our choice, without depending on the target machine. This was useful for us to create another annihilable module in python 3 / asyncio, which allowed us to rewrite some tasks using a competing model and thus save us deployment speed.

Some stats to finish….

Today, thanks to this approach: