Monthly Archives: January 2015

Always On Availability Group Secondary Becomes Unreadable

An interesting issue occurred in a client environment today.  The configuration was an Always On Availability Group which had a primary replica(Also a SQL cluster), a reporting replica(non-voting) and a DR replica in a different data center.  There was a cluster issue the previous night which saw the report server removed from the cluster.  The cluster and the availability group both recovered within 60 seconds.  However one of the databases in the AOAG could not be queried from.  The Availability group confirmed the database was healthy and data was transferring to it fine, but any select query just ran forever.  I ran through a list of checks, scanned the error logs and then went through a number of fruitless steps to get things running again:

  • Consistency check on the database was clean.
  • Database exhibited the same behaviour when removed from and then returned to the availability group.
  • Database exhibited the same behaviour when the replica was removed from and added back into the availability group(Not that I expected any different after the previous test, but if we didn’t try things we thought wouldn’t work we’d never solve half our IT problems).
  • I even transferred the DR server to readable state and confirmed the same behaviour existed there.

That all amounted to quite a bit of wasted time trying to find a quick fix, rather than really understanding the issue.  So I checked exactly what the select thought it was waiting on:

HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

The wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING is not all that well documented but I got enough information from this post and especially this one too to see where the issue lay.

There was an open transaction on the production database.  Until that was either committed or rolled back the secondary replica cannot use snapshot isolation transaction level, and therefore will not return a response(either error or result set) from the SELECT statement. We tried to stop the transaction from the application side, but ultimately ended up issuing a KILL command from within SQL.

 

It’s a Corrupt World

Luckily, in this day and age we don’t see so much database corruption, but it still sends a chill down the spine every time it pops up.  Having dealt with a number of cases in a number of different places now it’s not the corruption that bothers me so much as the pure amount of time it takes to fix.  Any unscheduled application outage is going to make you the focus of some very close scrutiny, and there’s going to be a lot of pressure to get things resolved as quickly as you can.    I’m going to assume you have detected the corruption with DBCC CHECKDB and have a minimum repair of repair_allow_data_loss.  Here’s some tips to get the application online as quickly as possible.

1.  Don’t try and get the application online as quickly as possible.  That way leads to mistakes and can compound the error.  Your goal should not just be to make this inconvenient problem go away, but to identify its cause, scope and resolve it in the best way possible.  Having identified that corruption exists your first move should be to make sure everyone who uses the application is aware there’s a problem.  At this point you need to assume that you are going to be dealing with dataloss.  If you don’t talk to your users everything that is going into the database can potentially be a further headache for you later.

2.  Start the process of retrieving or securing backups.  Hopefully you are running regular DBCC checks.  It’s worth starting the process of getting all backups between the last known good check and the time the corruption occurred.  Also, move those backups somewhere secure.  There’s nothing worse than seeing your (possibly) good backup suddenly get overwritten with a (definitely) bad one.

3.  Isolate the corruption.  You need to know where it is.  Once you know, the first step is NOT to fix it.  Your first step is to relay that information back to the business(or users) and let them know where the problem lies.  At this point get the application support guys in and confirm the impact of data being lost in that particular table.  You can save yourself hours with an informed application support person.  In the instance of corruption I was dealing with today it looked like a financial transaction table and the first 5 rows had 2 $70,000 transactions.  Not a table I wanted to lose data from.  The application guy was able to advise it was a table used only for logging which was usually cleaned up by an archiving process that just hadn’t been turned on for this server.  Knowing a table is essentially unused makes the decision process for the business very much easier.  (Unfortunately this particular application support person undid all his good work by following up with advice to the client to turn off checksum on all databases, a piece of advise akin to saying we can reduce the number of people convicted of murder by making murder legal).

4.  Now you have defined the problem, and the required fix, go ahead and fix it.

I know I made number 4 the shortest step, and it’s often very complicated.  But this is not a step by step technical guide to fix corruption in your database.  This post is about how to prep those around you to deal with the impacts of database corruption when it is identified, and after it is fixed.  In some cases the decision of how to proceed will fall directly to you, in others your recommendations will be considered and followed, and in others they will be discarded completely.  That’s the nature of working in a company, or as a contractor where other people have a vested interest in the database you are managing.  Leaving aside the technical aspects of the fix, it’s critical you manage people and expectation well.

Corruption is a big scary thing, but the biggest scariest bit is that when it strikes people have to make hard decisions, often with other people yelling at them or angrily wondering why their application is offline.  As a DBA, corruption is one of the scariest things you will face.  It’s not your fault(well, usually) but now you are the only one who can get the application back up and running.  This is your reason for being employed, and it’s natural to feel a lot of pressure to get things done as quickly as possible.  The best thing you can do…the very best thing…is not to try to do it as quickly as possible, but to do it as well as possible.  When this problem is resolved, be that in 30 minutes or after pulling an all nighter, there shouldn’t be anyone who is surprised by the outcome.  Just because the repair option is called REPAIR_ALLOW_DATALOSS doesn’t mean everyone will understand that data will actually be lost.  People are weird like that.  It’s your job to make sure that all interested parties know the implications of any fix you put in place BEFORE you put it in place.  It’s your job to present the problem, and then clearly explain the pro’s and cons of the possible solutions.