PDA

View Full Version : [Dev] 28/11/16 Outage Report



Jin
28-11-2016, 11:11 PM
On 28th November between 22:45 and 23:03 we suffered a brief outage the cause of which was discovered within 10 minutes.

What happened?
The new servers we migrated to were originally set up 3 weeks ago, back then there were two database servers that were replicating from each other with the basis being that we could offer high availability and load balancing between the two. However a week before we began the migration we were performing some tuning steps which broke the replication between host 1 and host 2.

We thought nothing of it and shutdown the second host on the basis that we would go back to repair it over the christmas holidays.

Whilst the sites continued to operate on the single database host, this host was generating binary log files which were not being processed by the second host as it was shut down. As a result it slowly filled the C:\ to 100% which is what caused the outage.

Why were the sites suspended?

Whilst attempting to fix the issue we were faced with max connection errors as the slowed down server clung onto each connection for several minutes whilst it slowly attempted to process the request. In the end the user traffic had to be stopped from reaching the database server so the fix could be applied.

How are we going to prevent this from happening?

There are a number of things we still have left to do on with our new infrastructure the first and foremost being our new monitoring platform which will monitor for things such as disk space utilisation, memory usage, processor usage and service uptime.

As we have migrated over in a hurry for commercial purposes we have had to priorities the web and database servers over monitoring. We are slowly catching up to these tasks.

Want to hide these adverts? Register an account for free!