Cliniko Outage

Matthew Jones·10 January, 2018

On Monday, January 8 between 10:05AM and 10:28AM Cliniko experienced an outage. All times are in UTC+11.

First up, we’d like to apologise for the interruption. We take issues like these very seriously. We strongly believe in transparency and that’s why we’re sharing information about this incident, how we managed it and what we are doing to prevent similar events in the future.

Background

Before we get into it, we’d like to explain some of the terms we will be using.

Cache

Cliniko stores frequently accessed data in caches. So what is a cache? A cache stores data so requests can be served faster, so it saves us time instead of querying the storage on the server. Think of it as remembering a phone number, rather than looking it up in a phonebook every time you need it.

System CPU spikes

CPUs (central processing unit) are the brains of a computer. They perform all the tasks required to operate the apps on our smartphones and Cliniko in the cloud. These tasks can be roughly divided into two groups, system and user:

User tasks are things that we can run, browsers, apps, databases, games etc.
System tasks are all the things that are required to run and manage the system and it’s resources (memory, running more tasks, saving and retrieving things from storage).

So when the system CPU spikes, the CPU is processing intensive tasks that are required to run the system.

What went wrong?

On Sunday, January 7 between 04:19AM and 06:36AM, we noticed a few slower than normal response times on Cliniko. This was a result of intermittent, system CPU spikes on the primary database server, which delayed some requests. We reduced the system CPU load on the database server and raised a support query to our service provider for more information.

On Monday, January 8 at 07:31AM, we noticed similar issues to those experienced on the previous day and closely monitored the primary database server’s system CPU usage and Cliniko’s response times. The frequency of system CPU spikes were increasing and by 09:48AM, we escalated the support query with our service provider. We reevaluated the findings and speculated that the intermittent system CPU spikes, seen on the database server, were mostly likely a result of the recent Meltdown/Spectre patches performed by the service provider, which can result in increased CPU usage. A potential workaround for the system CPU spikes was suggested.

At approx. 10:05AM, the workaround was executed, clearing the database server’s cache. This meant that the database didn’t have data it required in memory and had to fetch it from the storage. The result of this action immediately increased response times of Cliniko and by 10:07AM, Cliniko was unresponsive to the majority of requests. At this time, we discussed entering maintenance mode to allow the database to catch up with the amount of requests and give it time to start filling the cache back up.

The workaround was executed before understanding the full impact to Cliniko.

Recovery

The decision was made at 10:14AM to place Cliniko into maintenance mode and by 10:19AM Cliniko was in maintenance mode. We monitored the database closely and at 10:21AM the database was stable. Cliniko was brought out of maintenance mode at 10:23AM and was fully available by 10:28AM.

Where do we go from here?

1.1.Understand the impact of changes before executing them.
2.2.Investigate other potential workarounds for intermittent, system CPU spikes.

Again, we would like to apologise for the interruption caused. We always use problems like this a chance for us improve and learn, in order to reduce these incidents from occurring again.