This is something that a lot of traditional IT admins are looking for because they alway had to have a plan to keep their company's hardware, software and network up and running. That included the requirement for a plan on how to react if one of the components failed or crashed.
With cloud hosting, a lot of those paradigms changed and we should first look into the components that make up the infrastructure as such.
Components of the infrastructure¶
- All the hardware is provided by the ISP
- It is neither owned nor exclusively utilised by us
- DNS system
- Routing system
- Developed and maintained e.g. by a Drupal agency
- Customer files (images, pdfs, etc.)
What can go wrong?¶
The first two components (hardware and network) is maintained by an ISP (AWS, Google, Linode, JiffyBox, etc.) and what we know as the host, is no longer a bare metal box where something could break - more so its a virtual server inside of a bigger infrastructure.
All the ISPs we do support in this server farm are monitoring the individual components like power supply, hard drives, main boards, switches, routers, etc. on an ongoing basis and they even replace them early enough such that it's hard to believe that a hardware crash were something we would have to consider. The whole network infrastructure with the global DNS system and the routing is such a critical component in the global net, that we rely on the fact that the experts in those areas are keeping that up and running for us - there is nothing we can do to reduce the risk of (temporary) failure.
With regard to the third part, services, the installation and configuration of them is the result of a fully automated process and no manual interaction is ever been taken. A failure is therefore impossible and before we are going to change services or their configuration, we always test that on staging hosts.
The application layer is provided by an agency and they usually run a version control system. Of course, such an agency should never deploy fresh code to any live server without having taken everything possible to make sure that the code doesn't break anything. If it still happened, the agency has to deploy a previously working version of their application development.
That brings us down to the fifth and last part, the customer data. Let's ask ourselves what could go wrong with that data. It can be either overwritten or deleted unintentionally. This is only possible due to an error in the application provided by the agency or by user mistake. As all the customer data is subject to our backup strategy such files can always be restored and the damage is limited or even avoided completely.
How to recover from one of those failures¶
With all that in mind, it is hard to understand what scenarios would lead to a desaster that we had to recover from. In the very worst case there might be the following 3 scenarios:
- Loss or damage of customer files due to application error or user mistake: we recover from that by restoring the data.
- Corrupted services configuration due to an error made by a DevOp: in this case we simply rebuild that host from scratch and then restore the customer files.
- Data center unavailable: if this is temporary, this should usually get back to normal operation much quicker then any action we could possibly take. However, if the data center is completely broken (technically or financially) then we have to switch to a different data center and build our hosts there from scratch as well.
How do we build hosts from scratch? This is the same as adding a new host and is fully documented in that chapter.