Profile picture
Scott Williams - Infinity+1 @ip1
, 29 tweets, 5 min read Read on Twitter
A story about a non-conventional backup and DR strategy for ConfigMgr. Four or so years ago, before there was support for things like SQL Availability Groups or "HA" site servers, a ConfigMgr 2012 environment was built. This environment would be relatively small (5000 clients)
however it'd be operating in a very large organisation. Interestingly enough, despite the nature of use, namely to build and manage servers in the datacentre, it was allocated as an "incidental" service so not given any status of import. This was important for discussions on DR
It is a single stand-alone Primary, remote SQL (a requirement for all database services here) and a couple of remote DP's. So design-wise it's nothing complicated. In a DR scenario (think loss of datacentre bad) things got complicated
We could just go with the simple option of SQL backups, and if we needed to DR then simply restore the SQL backup and restore the site. Sounds easy, except that "large organisation" thing meant other much more critical systems would be given higher priority for restore
When I say large org, I mean *really* large org. If we had waited out turn, it might have been 2 or more days before our stuff could be restored from backups, then another day before we could get the VM's recreated and actually get it all up and running again
This then created a chicken/egg situation because this ConfigMgr environment would be needed to build the Windows OS on those new VM's for the backups to restore to... So an alternative approach was conceived.
The solution was to use the ConfigMgr backup maintenance task to backup to a local disk on the CM server. That disk would then be backed up by the enterprise backup system (for additional/offsite/archive purposes). The ConfigMgr VM would also be replicated in the other datacentre
So, now in a DR scenario if we lost the primary DC, we could spin up the replica VM's, copy over the database files stored on the local CM backup to a SQL server in the other DC then do a site restore/recovery. No need to wait for the enterprise backup system to get around to us
So yes, this is all very exotic and clunky and far from ideal, but it is remarkably bullet proof. We still have "tape" backups in the event of the local backup going bad, and still have SQL backups as an additional recovery option. Lots of backup redundancy.
At some point in time, the SQL backups were changed from being Full nightly backups, to Full weekly with Incremental daily backups. I didn't occur to anyone this would be an issue until a recent incident found a small gap in our recovery plan.
What happens if you don't have a total DR situation? What if it is just an "accidental" deletion of data, say for example, an automation process gone bad? What if you don't notice the data was deleted for a day or so?
So this is where you suddenly discover gaps. First, we can't restore the previous nights backup from the CM task because the delete happened before the backup, so the backup is also missing the data. What about the "tape" backup of that folder? Ah, well now...
it turns out the "tape" backup specifically excludes "database files", which is kind of fine, except nobody had told us it did that, so non of the CM backups had the SQL database files in them. So Plan C, the SQL backups...
Simple really, restore the last SQL full backup, then restore the Incremental backups. Except this is where we discover that when CM does it's own backup of SQL databases, it doesn't do it as a "copy", but as a full backup.
What this does is each time a full backup runs, the next incremental backup is taken from that point in time. In this case, all the inc backups were based on the full CM backups, not the weekly SQL backup.
Since the CM backups had the database files filtered out, we had no full backups that the Incrementals could restore from. Effectively the backup file restore chain had been broken.
So, we have two quick fixes to remediate this situation.
1. Use AFTERBACKUP.BAT to rename the SQL backup file to .md_ and .ld_ and now they get backed up by the "tape" backup system (it just filters on extension)
2. Set the daily SQL backups to Full
So this leads to my question around modifying the ConfigMgr backup options. If the SQL backup part can be changed to be a "Copy" rather than a "Full", then it won't break the file restore chain if SQL itself is doing incrementals.
Now keep in mind, this was a process developed over 4 years ago, and this is the first incident like this. In the end we were able to restore back to the last full SQL backup and only lost a couple of days of stuff, so nothing very serious
but it does initiate the discussions around what can be done better. I don't actively manage this system anymore (owned by another team), but that may change again after this.
My first suggestions would be to set up SQL AG, then look at HA for the Primary with the new CM features
This would remove the issue around having the local CM backups and we could then go to native SQL backups with CD.latest etc covered by normal "tape" backup as well.
Since a "data loss" scenario doesn't have the issue of needing to wait for a restore, relying on SQL backups for that scenarios would be fine, and we could even revert to Weekly/Inc backups.
so to summarise, don't forget to include *all* your recovery scenarios in your DR testing, and don't forget to include accidental data loss as a recovery point, not just total system or site losses.
On that "data loss" thing. How is it nobody noticed data had gone missing? Good question.
The automation process had deleted all computers in the database the night before. The computers had checked in again overnight and resubmitted DDR's so new records were populated for them
Nobody noticed until later that day that collections that had direct memberships were all empty, but the query based collections were mostly OK. At that point I noticed there was no inventory or history data on any computer and they all had recent create dates.
So automation is great, until it isn't
So anyway, sometimes things are never as simple as you might like them to be. Technical simplicity can't always overcome political and bureaucratic obstacles, and there are very few "bad" solutions to things. There may be "good" and "better" ways, but ultimately use what works.
and a final note to finish on, just remember it doesn't matter who you are or why you've done something a certain way, any question you ask will usually get a "but why didn't you do it xyz way instead" 😀(Note: this isn't a bad thing coz maybe you never knew that was an option)
@threadreaderapp please unroll
Missing some Tweet in this thread?
You can try to force a refresh.

Like this thread? Get email updates or save it to PDF!

Subscribe to Scott Williams - Infinity+1
Profile picture

Get real-time email alerts when new unrolls are available from this author!

This content may be removed anytime!

Twitter may remove this content at anytime, convert it as a PDF, save and print for later use!

Try unrolling a thread yourself!

how to unroll video

1) Follow Thread Reader App on Twitter so you can easily mention us!

2) Go to a Twitter thread (series of Tweets by the same owner) and mention us with a keyword "unroll" @threadreaderapp unroll

You can practice here first or read more on our help page!

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just three indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member and get exclusive features!

Premium member ($3.00/month or $30.00/year)

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!