Failovers can’t serve two masters

As I’m sure you’re aware, the career path for a SQL Server DBA is a wonderful journey of growing your catalog of knowledge through unexpected lessons. For example: Did you find out the hard way you need to test restoring those backups? Did you discover you need to enable the dedicated admin connection after a server locked up? Did you learn the unexpected results of using reserved words as object names?

Don’t worry, you’re not alone.

Speaking of reserved words, I recently learned there’s a bit of a situation regarding a certain word and logical file names. Let me tell you a story.

One of the instances our team manages is on a two-node failover cluster instance. This particular instance has dozens of databases from different branches of the business, including some from recent acquisitions. We restore the databases from the previous owner’s server, check some settings and permissions, and then treat them like would any other database. You know – back them up, give them regular maintenance, sing them Frank Sinatra songs when they feel sad.

Well, on this instance we recently had to apply a newer service pack. So we upgraded the version on the passive node, clicked to fail over to that node, and…the failover failed. Over and over it failed at failing over. Total failure.

So we uninstalled the new service pack on the passive node, then reapplied it. Still won’t failover. We uninstalled all SQL Server components from the passive node then reinstalled them, including the new service pack. Still won’t fail over. Then we tried singing “All Or Nothing At All.” Still won’t failover.

Then we tried lots of swearing. Then we threw small objects. Then we called Microsoft.

Now, if you’ve worked with their support team you probably know they can…take…some…time. Running diagnostics, waiting for the support person to get back from holiday, waiting for them to evaluate the results, being escalated to someone else, running more diagnostics, waiting for that new person to return from holiday, having them evaluate results, being told the old results expired, etc.

Although it took more than a month’s worth of weeks, the good folks at colored squares eventually found an answer to our failure to failover: we had a user database with a logical filename of “master”.

And I was like…say whaaaaat???

Here’s how it was explained: when you upgrade a node of a cluster with a service pack, cumulative update, GDR, whatever, to a different version than the currently active node, SQL Server has to run some scripts in the “master” database to upgrade or downgrade the instance as it moves to the new node during a failover. And here is the actual SQL used to find “master”:

   DECLARE @dataDirName NVARCHAR(520)

   SELECT @dataDirName = SUBSTRING(physical_name, 1, CHARINDEX(N'master.mdf', LOWER(physical_name)) - 1)

       FROM master.sys.master_files

       WHERE (name = N'master')

Why yes, that script is looking at logical filename in sys.master_files. Not the database id, which should always be 1. Not the logical file name AND the database id, just to be sure. Just file logical file name. Huh.

Granted, I don’t know how someone actually comes up with a logical file name of “master” in a user database. Do they restore a copy of the master to create a new user database? Do they get funky with the “model”? Is this just some sort of joke that went all wrong? So many questions as to how we ended up here.

At any rate, we found the offending database and modified the logical file name. Then we attempted a failover, and…success! We then applied the service pack to the other node, successfully tested a second failover, and then went out for a round of adult beverages.

And there was much rejoicing.

Consider this “master” name quirk yet another thing you need to be file away in your catalog of database knowledge. Right next to singing “High Hopes” to overworked instances.

4 thoughts on “Failovers can’t serve two masters

  1. Now that’s a nice prank to play on in case your company ever gets bought out. Rename a few file names right before you leave.

  2. We set the Logical Name on a Restore based on the DB name. Then we can restore “MyDB” to “RESTORED_MyDB_20181211” and be sure it has different logical names 🙂

  3. I bet they’d have diagnosed this quicker if they’d followed their own advice and used SET instead of SELECT 😀

    Also…NVARCHAR(520)? physical_name is NVARCHAR(260). And I can see the install failing over if any of the directory names in physical_name contain the string “master.mdf” (incredibly unlikely, but I’m surprised at how ambiguous and assumptive this process is — I’d really expect the installer to be doing an actual path split rather than mangling strings with SQL)

Leave a Reply

Your email address will not be published. Required fields are marked *