banner cross

By using our website, you accept that we use cookies to perform analytics and produce content & ads tailored to your interests.
Read our policy on cookies usage.

Having a disaster plan is cool but testing is better

The aim of this article is to show How and Why we decided to test our DRP (Disaster Recovery Plan) under actual conditions… inside our production environment.

What is a DRP?

Toucan Toco’s solutions are historically SaaS solutions hosted on bare metal servers.

Of course as a SaaS editor we follow the usual best practices about our infrastructure:

Dilbert by Scott Adams - DRP

We are confident all these things work. As time goes on they become more and more robust because we use them everyday. However, we wondered if we would be ready should in case the datacenter of our servers provider exploded?

Even if the probabilities are pretty low, we need to be ready and know how to react.

Because we are a SaaS solution with customers relying on our product on a daily basis, and committed to a SLA, it’s just mandatory to know how to handle the worst.

The disaster recovery plan (also known as DRP) becomes your operationnal team’s capacity to manage major issues.

At our scale, having a fully replicated infrastructure over different data centers did not make sense.

We have a practical mindset. If a a natural disaster were to happen, we wouldn’t need to have 2 functionnal, identical and up to date houses because we would be able to rebuild a brand new one in less than 2 minutes.

And since we had all the tools to restore our client’s stacks and services, our backup, snapshot and migration, it was easy to connect the dots and imagine a DRP.

For a small team like us it was an interesting quick win.

This summer we wrote our complete disaster recovery plan. Our DRP is a set of docs, procedures, methods and scripts to recover the Toucan Toco’s business after a disaster hit our data center.

At the same time, we also bought some spare servers in a different datacenter. They are up and production ready and they will be our fallback in case of a major issue.

Why should we test a DRP?

Isn’t it obvious? :D

You need to test it before a disaster happens.

Otherwise it’s just a theory and we don’t buy into the famous saying Don’t worry it should work

When The Joker Is Laughing

Come on!

If you read this post, you’re probably an IT person and you know very well the Murphy’s Law: if something can go wrong… it will.

So you need to test your plan for several reasons:

That’s not all.

Inspired by the chaos monkey tests and approach from Netflix, we strongly believe these kind of tests should be run for real, in a production context.

It’s like learning how to swim: you can learn the moves outside of the pool, but if you want to prove to yourself you can swim, you need to do it at the deep end of the pool.

How to test a DRP?

Once our DRP was ready, we planned a crash test in our production environment. Only few people were in on the secret, the rest of the team only knew something was planned without any details.

Why? to reproduce the context, the “surprise” and the stress of the situation.

However, we targeted a part of the infrastructure with no direct business impact. It was the very first time and we are not totally crazy :P.

We wanted to avoid the “emergency buildings test pattern”: they are never done the right way. The alarm rings, people grab everything (phones, bags, laptops…). They walk casually talking with their colleagues and finally they get out of the building… because they know it’s a test.

These tests are not relevant and not monitored on actual conditions, when a real emergency occurs it’s a huge mess. People are not always hurt or killed by fire but because of panic and stress.

Back to our subject. The main goals of the tests are:

To simulate a real disaster case was pretty simple.

We decided to just drop all the network traffic from our partners, demos and trainings infrastructures.

Let’s see how we’ve done it.

Release the kracken!

We’re here!

Everything is burning and I’m watching the fire!

Everything is burning but it is fine

As previously explained the aim is also to validate the tech team is able to manage the situation without me. So during the issue, I was just taking notes and checking the following points:

During the simulation, I asked questions to challenge the team decisions, choices and to make them doubt themselves a little… ^^ (otherwise it’s not funny)

Finally after 19 minutes of downtime, the whole partners, demos and trainings infrastructures were reinstalled, restored with the latest backups on “from scratch” servers in an other datacenter by using only scripts without doing anything manually and without me.

GG Team !

Lebron Celebrates With Fans

What’s next?

For a first test, we were pretty satisfied by the result.

Even if it’s only a part of an infrastructure, the procedures and the scripts stay the same.

Knowing what we know, if we had to reinstall and restore a complete Toucan Toco infrastructure from nothing (all clients stacks, private and public services, CICD, monitoring….), it would take us less than 2 hours.

For a small team like us, it’s not too bad, we know we’re ready and the Toucan Toco team is fully autonomous to face a disaster without me.

We need now to improve our DRP but we know it’s a “never ending” job.

Because we did the test in real conditions, we know what we need to change parts of our monitoring, logs, scripts and documentations.

Since that day, we learned where to focus:

So finally what to do next?

I will enjoy my holidays because I’m French, and plan another test to confirm we became better :)

Smiter Simpson