An interesting issue occurred in a client environment today. The configuration was an Always On Availability Group which had a primary replica(Also a SQL cluster), a reporting replica(non-voting) and a DR replica in a different data center. There was a cluster issue the previous night which saw the report server removed from the cluster. The cluster and the availability group both recovered within 60 seconds. However one of the databases in the AOAG could not be queried from. The Availability group confirmed the database was healthy and data was transferring to it fine, but any select query just ran forever. I ran through a list of checks, scanned the error logs and then went through a number of fruitless steps to get things running again:
- Consistency check on the database was clean.
- Database exhibited the same behaviour when removed from and then returned to the availability group.
- Database exhibited the same behaviour when the replica was removed from and added back into the availability group(Not that I expected any different after the previous test, but if we didn’t try things we thought wouldn’t work we’d never solve half our IT problems).
- I even transferred the DR server to readable state and confirmed the same behaviour existed there.
That all amounted to quite a bit of wasted time trying to find a quick fix, rather than really understanding the issue. So I checked exactly what the select thought it was waiting on:
There was an open transaction on the production database. Until that was either committed or rolled back the secondary replica cannot use snapshot isolation transaction level, and therefore will not return a response(either error or result set) from the SELECT statement. We tried to stop the transaction from the application side, but ultimately ended up issuing a KILL command from within SQL.