As many of our customers are aware, we had a significant outage yesterday. Luckily this didn’t affect all of our customers, but that was certainly no consolation for those that it did.
So what actually happened? First I need to explain a little about DNS, but I will keep it very simple. Basically we have a domain name “cliniko.com” that we own. We do quite a few things with this domain name:
- http://www.cliniko.com and http://cliniko.com point to our website
- https://*.cliniko.com points to the software (where * is your site address)
- http://support.cliniko.com points to our support site
- @cliniko.com is used for our email addresses (where are the email accounts, eg. support).
We have a DNS (domain name system) provider that receives all requests to cliniko.com, and sends them to the right place. That is the DNS component of our infrastructure. Cliniko is hosted somewhere separately, and the actual domain name cliniko.com is also somewhere separate. DNS is the thing that joins them together.
Yesterday our DNS provider (Zerigo) received a very large DDOS (denial of service) attack. This is usually when someone malicious sends many requests to server to make it so busy that it can’t respond to normal traffic as it would. It is not something that is a risk to security (both because DDOS attacks are purely disruptive and because our DNS provider has nothing to do with the actual Cliniko software itself).
Not all of Zerigo’s servers were attacked, and they somewhat randomly allocate different people to different servers. Those that were allocated to the attacked servers are the ones that experienced the problems.
The result of this was that for many people, the Cliniko software was unavailable, the Cliniko website was unavailable and our support site was unavailable. This is obviously a terrible result.
We had been very happy with Zerigo for the last 18 months, with no known outages. This outage however was unacceptable for a couple of reasons:
- It was way too long. We expect our providers to have appropriate disaster recovery and redundancy measures in place for when events like this happen (and they will happen).
- The communication was very poor. We immediately phone called, emailed and contacted them on twitter. We still haven’t received a response to any of our requests, we have been at the mercy of their public status updates (http://zerigostatus.com/) which have been few and far between.
So what did we do yesterday to handle this? We were first notified of the issue at around 1pm AEST. We instantly investigated with Zerigo and found they were having issues (not by their status page which wasn’t updated yet, but by the numerous people messaging them on twitter with the same complaint). Soon after Zerigo did update their status page to indicate they were aware of the issue and were working on resolving it. At this stage we felt our best plan was to wait for them to resolve the issue, expecting them to able to do so very quickly. Our other option was to move to a new provider, but this is risky without proper planning and time for testing.
We made sure we were extremely available to our customers. We know how bad it is when Cliniko isn’t working and even if we couldn’t fix the problem, we want to make sure we were very open with communication about what was happening and very accessible. We responded to almost every request within 1 minute. Requests came from email, our support site, telephone, twitter and facebook (and there were many).
At about 5pm AEST, we couldn’t wait any longer for Zerigo. We immediately began work setting up our DNS with another provider and by 5:30pm had this completed and made the request with cliniko.com to point to the new DNS provider. This wasn’t our preferred resolution due to the risk of an unplanned change, but at this point getting control over our DNS was number one priority. Once this was done, we were in the frustrating position of waiting for the changes to propagate through the internet (this relies on internet service providers picking up the new settings). The problem was fixed for most users by around 10pm AEST, some weren’t resolved until 5am AEST the next morning.
We provided full support and communication through all channels until approximately 1:30am AEST and then limited support via facebook only between 1:30AM AEST and 7:30AM AEST when we resumed full support again.
Looking back at this event, and thinking about what we could have done differently:
- We should have changed providers as soon as we saw the communication was poor and there was no resolution forthcoming (about 1 hour after the trouble started instead of 4 hours after). There was nothing to lose with the changeover, so we should have done this sooner.
- We need to make sure every technology service we use has a strong history of reacting well in the event of failure. We hadn’t done enough evaluation of our DNS provider as it felt like such a small and simple part of the chain. We were obviously very wrong there. We have now gone with a provider that we have seen their history, read their customer reviews and spoken to them personally (their support has been great already).
Although the outage was caused by a 3rd party provider, they were a provider we selected. This post is not at all to shift blame, we take full responsibility for the outage. We are the ones committed to having Cliniko available whenever you need it, and we let you down on that front. We will take some positives out of this and also look to see what else we could do to reduce the impact of an outage if it occurs again (and I do expect it will, but our recovery should be much much faster).
There were mentions by some of our users about this being a drawback to online systems (and I don’t blame them). Really I think this can happen to any system though (online, offline or paper). If you are running a local system and your server dies, now you are without your system, you may have lost data and you need to fix it at your own expense. With us, there is no loss of data (we have so many backups!) and you have a team of experts that are hustling immediately to get it fixed without any effort or cost on your side. This doesn’t for a second make us second guess an online system, in fact I still think it highlights some of the advantages (also SMS and Emails still went out as normal).
Also, not to finish on all doom and gloom, here are our uptimes for the past 12 weeks (oldest to newest):
99.84% — 99.96% — 100% — 99.79% — 99.98% — 99.74% — 99.91% — 99.99% — 98.84% — 100% — 99.98% — 100%
I hope this explanation gives you some insight into what went on yesterday. We are committed to be open and transparent about Cliniko, so have put this out there for our customers and also potential customers to see.
Thank you for your understanding and again… we’re sorry!