Thursday 9 July 2015

What to do if all your IT goes up in flames

Interesting session today about Business continuity and disaster recovery, focusing on the aftermath of the Crowmarsh Fire. Covers some theory and best practice around BC and DR, but I'll just focus on the Dr after the fire. A slightly scary story from the IT Manager for two small Local Authorities in South Oxfordshire.

The main services are provided by the LAs were waste collection, planning and building control, housing, food safety, council tax collection and benefit payments.

All services are shared between the two councils. All staff based in one place in Crowmarsh. Having recently relocated, and left a property which they now lease to Oxford City Council.
The IT department had 440 users. 1 main data centre with remote back up. Mostly on premise applications. Onsite back ups disk-disk-tape. Most Servers virtualised using VMWare. Most data stored on SAN technology. Some physical servers.

Om Thursday January 15. 2015, there was big fire at their location requiring 27 fire crews. Raged from 0330 till late afternoon, then reignited at night. A car loaded with gas bottles has been used in an arson attack. They effectively lost the whole building.

The story for the IT Manager....

Call from building manager at 3.30am to say building alight.
Took on role to raise senior management board and then initiate emergency plan.
Initiated the IT DR plan. Called suppliers to get plan started and equipment delivered. Had a contract with a company for hot standby. About a 4 hour lead time.
Made decision to use the building they had recently moved out of (Abbey House) where there had been a data centre. That had been identified in the plan.
Contacted BT to get numbers rerouted. That was also in their plan. Redirected to the switchboard in another building
Contacted IT team and relocated to them Abbey House.

Crisis management team had been set up including Senior Management, IT, HR, comms and members of emergency services. First meeting at 7am
Building had police cordon because no one knew why and if other buildings would be targeted
Back up tapes needed but there was only only one key to the safe which was on someone desk which had been destroyed in fire! Rang locksmith to break into safe. Had them before equipment had turned up

Existing infrastructure in building configured for use
DR plans checked so people know what to do
Equipment delivered to remote site by 11am
Set up equipment, rebuild of restore server.
Initial run of back up tapes, found problems with tape drive. Thought back up tapes were damaged :-(
Wasted several hours. But had been sent wrong type of tape drive. So had to get new one. But lost a day
Shared bandwidth with Oxford CC to get access to Internet and set up some temporary web sites

7pm, sent everyone home to rest.

Friday

New tape drive delivered and restores started.
Contacted key suppliers and asked for help where needed. Suppliers offered engineers etc. accepted all help.
Emergency laptops purchased to get frontline staff working. Housing staff especially as they dealt with vulnerable people, they used a hosted application. Bought Staff mobile phones.
Lot of laptops and mobiles lost in fire
Old XP machines used for temporary desktops.
Put up temporary web sites to give public information and key things they needed.

Saturday

Restore fully underway of key systems and data bases. Was an issue getting AD back
Had regular meetings of the team throughout the process
Migrated mail to Office 365. Needed to get mail working, was about to do it anyway so had licences, scripted process to automate user creation. Within 2 hours had fully functioning email system, through a browser. Didn't restore legacy system.
Building laptop image for use on Monday.
Live websites back up by Sunday night
Over weekend had an issues with available storage space for VMfarm.

Monday onwards

Limited number of desks, so only limited space, had to improvise!
Initiated VDI project to replace desktops. Had a new desktop within a week. A week!!!
All system and data recovery completed by Wednesday.
Challenge then was to get rest of business working.
Buy buy buy, build, build build
bought New laptops, thin clients, replacement physical servers. Built new desktops and laptops
Issues.
  • Limited office space, staff had to work in rotas.
  • Expectation of people about how long it would take to recover, assumption it would just happen
  • Out of hours support for emergency changes, some providers didn't provide out of hours support.
Challenges
  • Needed to minimise impact on services to the public
  • Deliver major elections, largest set of elections for 30 years
  • Office accommodation for staff. Leased some space, brought some old buildings back into play
  • Communications to staff on what was happening, regular briefings in local town hall.
Where are they now?
  • Just moved into new offices
  • Still running on DR equipment.
  • Impact on projects, lots of delays

Lessons
  • Test your plans
  • Be prepared to change plans, eg move to office 365
  • Never assume anything, like there's a spare key!

Other points
Out of news after 2 days, so consider it. Success
25m insurance bill!
Video of damage, lot of damage not done by fire but water and smoke.
Had to do full data destruction process on all kit damaged

Really great case study.
.


No comments: