Backup and restore is a hot topic at the moment since many companies are finally deploying SharePoint technologies globally. Many companies had purchased their licenses but had not put them to use – this is changing.
As with any large global corporation, deploying a new product is a challenge full of risks (the Devil is in the details) that requires much thought, planning and testing. Think about it for a minute, thousands of users, limited data center resources, new product, lots of unknowns – operations people generally hate this sort of thing.
So what factors influence your backup/restore plan? We’ll here’s a few items to think about such as the following:
- Service Level Agreements (SLA) – what is the agreed upon availability? Should the system become unavailable, what’s the agreed upon time for restoration of service? Ownership?
- Backup and restore speed – What is the tested backup speed given the reality of your network and storage systems?
- Site collection size – what is the size of the site collection(s)? What is the projected time for backup/restore? Keeping Site collection size under control help manage this. You might want to keep higher priority sites on a separate site collection.
- Have you externalized content? You must sequence your backups accordingly, content databases first then Blobs.
- The steps involved – what exact steps are involved in backup and restore?
I believe the key to a successful backup and restore process is tightly linked to the aforementioned points. For example, I worked at a large hospital years ago as a manager. One of my tasks was to clean up the mess left by the prior team – testing and formalizing backup and restore was at the forefront. The Data center guys swore up and down that they could backup and restore a system. So we put it to the test buy performing an upgrade on a few servers and asking them to restore. Four days later and every excuse I’d ever heard was given for not being able to restore. I must admit that I did not feel sorry for them (because of the way they handled the situation) when management discovered the enterprise backup and restore didn’t work. Having a backup plan myself, I pulled the original drive arrays from my storage area and returned the servers to their original state.
The moral of the story is that you must test your ability to backup and restore your systems so that you can guarantee the SLA. Know exactly what must be done to recover the systems, have it documented and tested on a regular basis (Example of documented process).
For example, documented steps for:
- Backup and reviewing the logs for success – use the manufacturers recommended steps as the basis for your plan. Focus on looking for errors, time to backup – changes to the schedule.
- Rebuilding failed servers – use the manufacturers recommended steps as the basis for your plan. Think about media, updates, time to restore, patches and sequence of tasks.
- Restoring data on failed servers – use the manufacturers recommended steps as the basis for your plan. Adjust the plan to take into consideration the aspects of your environment.
- Testing the recovery – work with the business units to create test cases that map to the SLA. Focus on recovery time and accuracy of data.
- End user sign off process – get sign off from the users that the testing was successful.
I’ve witnessed this used successfully in several large organizations and having implemented it myself I can say it helps with managing those day to day operations headaches.