[Dev] Long Downtime Period [Archive]

Jin

16-03-2011, 04:13 AM

So first of all huge apologies for the very very very long downtime period which lasted about 84 hours, an explanation is due I will try and keep it as simple as I can so everybody understands sorry for spelling and grammar mistakes I haven't had much sleep since these issues began.

DC = datacenter (where our server is located)
Host = the middle man between us and the datacenter
disk = Hard drive

12th March

15:06: Work began moving services running on our server to a new disk to relieve R/W wait periods on our main disk.
16:06: Data found to be corrupting silently (we monitor for corruption) on our main disk.
16:10: Sites closed so recent backups can be taken.
17:48: Host is given orders to replace primary disk and reinstall OS.
18:46: Host finally agrees that disk replacement is required after much discussion.
19:33: Final check of services is made on servers and go ahead given to Host to begin work.
22:00: Confirmation that job has been passed to DC is given.

13th March
01:18: Disk is replaced and server is handed back over.
01:20: Restoration begins with core services to server.
04:13: Restoration begins with Habbox websites.
07:23: Backups made from original server show signs of corruptions due to data being stored on the bad sectors of the disk.
07:30: Request made to host to restore original disk so new backups can be made.
09:13: Confirmation that job has been passed to DC.
09:59: Confirmed disk has been swapped.
10:15: Server refuses to boot correctly, investigation request made with host.
10:49: Host confirm request sent to DC.
11:17: Datacenter state that corruption has occurred to the boot partition of disk.
11:25: Request made that primary drive is swapped back to the new drive and disk mounted within system.
12:33: Host reply with message from DC stating that new primary drive also has become corrupt due to motherboard fault which has been corrupting data.
12:45: Request made that motherboard is replaced ASAP.
18:47: Server handed back with new MoBo and Disks and loaded as rescue os.
18:33: Restoration of core services begin again.
22:06: Backups are made and verified, tables show signs of corruption.

14th March
02:12: All tables downloaded to local machine for repair.
04:43: All tables repaired on local machine.
05:02: Request made for OS to be reloaded.
07:15: Confirmation received that job has been passed to DC
Sleep
16:00: Server has been handed back.
16:19: Server Ip's are incorrect compared to those previously allocated, OS is also different. Host notified.
17:13: Some crap story given back and told to update Ip's in globalDNS.
17:56: globalDNS system interface not working (nothing to do with us).
18:23: Request made with host to gain the server id number which references our machine in the DC.
20:34: After much arguing server id number obtained.
21:15: Phone call made to DC and issue transferred to ticket with an account previously held with them.
21:25: DC agree that mistakes have been made and they only acted from the requests from Host, work begins to restore OS and ips.
22:00: Repaired tables on localhost uploaded to cloud. Sleep.

15th March

07:00: Server handed back from DC.
07:15: Restoration of core services begin.
12:12: core services completed.
12:32: Backups download begin.
12:42: Download fails, fault traced back to incorrectly configured network card.
13:00: DC is called. DC ask for the job to be authorized by the host.
13:04: Request made to host.
13:35: Configuration resolved.
14:09: Backups complete download from cloud.
14:48: Habboxlive restored.
15:00: Work begins on Habbox.com
17:23: Deep corruption of critical v5 habbox.com files in premium modification, sierk contacted to find source files.
18:12: Source files couldn't be obtained, belongs to an ex-member of staff. V6 beta release uploaded.
18:30: Forum database restoration begins.
| : Restoration fails, configuration changed and retry begins.
| : Restoration fails, configuration changed and retry begins.
| : Restoration fails, configuration changed and retry begins.
| : Restoration fails, configuration changed and retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
| : Restoration fails, innodb table pool mismatch. mismatch resolved. retry begins.
V
03:05: Forum Restored.
Power nap
07:00: Basic security hardening begins.
13:30: Forum reopens on a limited service.