I struck an interesting problem this week when migrating a database. Apparently the database was my users default, which immediately kicked me out of the server and prevented me logging back in. This is the story of how it happened and the steps I needed to take in order to resolve it.
It wasn’t a dark and stormy night, but considering what was about to go down it probably should have been. Typical weather really.
I was migrating the final database off and old SQL 2012 server. It was the final database on this server which we began migrating databases off over 18 months ago, so I was pleased to see it go. The database itself was tiny, so my part of the migration was just to perform a backup restore migration to the new server. The full backup ran fine in a couple of seconds, but when I ran the tail-log backup it completed and I immediately got disconnected.
I thought that was odd, but proceeded with the restore on the new server which went fine. We were using a CNAME redirect to catch any connection strings we didn’t know about, so the old server was renamed as part of the process. I asked the server engineer to hold off his rename and CNAME steps while I investigated the disconnect.
I quickly found I couldn’t log in with either my own account or the service account our application connects to SQL. Both received the same error.
Cannot open user default database. Login Failed. Login failed for user ‘xx’.
So, we have provided support for this client for a while, but the server predates our companies support contract. However what it looked like to me was the group containing our logins had it’s default database set to the database I’d just taken offline. Annoying, but not insurmountable. I asked one of the client testers I knew had security permissions on the server to log in and flip our groups default database back to master. But she got the same error.
Things were suddenly looking dark. I quickly completed the migration steps on the target server and we did the server rename and CNAME switch to the testers could test the successful migration, but what should have been me sitting quietly for an hour while they completed testing quickly became a mad flurry of trying everything to get access into the old server should we require a rollback. Now I wasn’t too concerned, we did have a snapshot taken prior to the change that was our sledgehammer rollback, but I didn’t know what was going on and wanted to understand.
There was a simple fix I should have done at this point, which is just to change the database to master in my connection string. When I did that it failed. I don’t know why, and haven’t been able to recreate that failure, which leads me to think that when I changed the database it didn’t ‘stick’ and I only tried it once, assumed it failed and moved on to find another solution. A lot of time should have been saved here by doing that correctly. For the record that is on the options->connection properties screen in SSMS. All you need to do is type in ‘master’ as per below:
That should allow you to log in and update the default database of users who can’t log in. It also shows why trying the same thing twice is sometimes NOT the definition of insanity, because you may have done it wrong the first time, particularly if you are restoring databases in the other window.
At any rate – assuming the direct connect with the correct default database was still failing we tried and failed the following steps:
- Access the server via “sa” – only to find the “sa” password was not recorded in the client password vault.
- Access the server via various other Windows accounts – All failed with the same default database error.
- Access the server via other SQL accounts – These succeeded but none had the credentials to add much to the troubleshooting or remediation.
- Force admin access via single user restart – This also failed with the default database issue.
- Force sysadmin access via DBATools. This is a really cool feature of DBATools, but in this case failed with an error that didn’t provide much context on the problem(I suspect it was the same default database login failure causing the issue).
So what we knew at this point was that all windows logins failed, while all SQL logins succeeded. I was beginning to suspect I knew the nature of what had happened. Some account that windows logins were a member of had been added with the default database set to the one I had just dropped. But how to prove it? And how to fix it?
I took the following steps:
- Create a local windows user.
- Add that user to the servers local administrator group
- Restart SQL in single user mode with access limited to SQLCMD connections
- Run my SQLCMD script to create a SQL login in the context of the new local windows account
- Let out a huge sign of relief when that worked.
- Restart the server in multi user
- Log in with my shiny new SQL sysadmin account and start the troubleshooting.
Once I had sysadmin access back troubleshooting took only a few seconds. Sure enough the DOMAIN\USERS account had been added to the server. It had no specific rights on anything but had the default database set to the database we’d just taken offline. I flicked that across to “master” and everyone was immediately able to log back in again.
All of this took around 1/2 hour to work through, and ultimately didn’t have any impact on the migration except for a couple of minutes delay before I handed the database over for the rename. I was able to do my troubleshooting in parallel with the application testing, so ‘no harm, no foul’ right? Well, yes. But it could have been so much worse if this wasn’t the last database we shifted off the server. This was an incredibly dangerous setting. If we had moved this database first instead of last the simple act of taking it offline would have taken every other application on the server offline.
Hopefully, if you find yourself in the same situation the information in this post will save you some time both diagnosing and resolving the problem. I’ve definitely added checks for Domain\Users accounts into my pre-migration checks, and while I don’t ever expect to see this scenario again, if I do it’s nice to know the fix is 10 seconds and 2 clicks away on the login screen.
- Be aware of the Domain Users group when making changes in a production environment.
- When troubleshooting – keep notes of all the bits you tried – because you may make a solution way more complicated than it actually needs to be