Author Archives: Rob

About Rob

http://gumption.co.nz/about-me/

Backup Questions to ask: Part Two – Retention

In my previous post on backups I talked about division of responsibility and documenting your process. In this post I wanted to talk about backup retention, how to decide what that retention should be, and how to get the most millage out of your local disks.

I’m going to be talking about a simple backup process involving the DBA taking care of native SQL backups to disk, and the system administrator taking those flat file backups and moving them to long term storage. I defined it in my previous post like this:

I try and simplify backups to something approximating the below:

1. The DBA takes the native SQL Server backups to a local disk. These backups are retained for X days.

2.The system administrator is responsible for taking a copy of the native SQL backups from the local drive and copying them to long term storage, where they are retained for X2 days\month\years.

Once you have agreed the above process it should be a relatively simple task of working out what is X and what is X2 and then putting it in place. You’ll notice the most technical I have got to this point is referring to ‘native SQL backups’ and ‘long term storage’. I haven’t talked about full backups, differentials or log backups. I haven’t spoken about high availability or disaster recovery. The reason for this is right now this is not a technical discussion, it’s still a business one. It’s really easy for us as DBA’s to jump in and put in place a standardized recovery process, and in most cases that’s good enough. But actually what we need to do first is understand what the business needs for retention.

X and X2 are going to define your Recovery Point Objective – the point in time you could recovery to. In other words how much data could you potentially lose if you restore to a certain point in time. This is important, but it’s also only part of the picture.

A consideration that is regularly overlooked is the Recovery Time Objective – or how long can an application be offline. This is critical and something you simply don’t know unless you perform semi-regular restore tests. The reason I say that is that if you have an RTO of 1 hour, and your full backup takes two hours to restore – you will never be able to meet that target with backups alone. You simply can’t make restores run faster after a certain point. If you have an RTO that is less than the time it takes to recovery your backups, then you need to provide that availability another way – and that’s where the discussion of high availability needs to begin. We’ll dig more into that in a future post.

Assuming that you can restore within the RTO in all your known scenario’s then the only question is developing a backup strategy to meet your RPO. The starting point for a lot of DBA’s is a nightly full backup and 5,10,15 or 60 minute transaction log backups. This meets the most common restore scenario I have seen(most restores are within the past 24 hours in my experience). But it also plays havoc with the amount of data you are moving to long term storage.

I’ve recently been experimenting with much longer recovery chains. A monthly full backup, weekly differentials and log backups every 15 minutes has proven to be a good starting point in a lot of the bigger databases we support. These databases are often 95% or more data that never gets updated, and the updates are quite small and not that heavy. In one case we moved from a 254GB nightly backup to this model and the total combined size of a months differential and log backups was 23GB. That allowed us a massive saving of disk space with much faster restores to any point in time in the last month.

My point is it’s not 2000 any more, the GUI allows us to create restore scripts so we are not relying on manually crafting restore processes with 500 log backups – SSMS spits the code out for us. It’s as much work for you to hit F5 whether it’s restoring 10 log backups or 200.

The other approach to this is breaking archived data into partitions or using file groups so that you are only ever taking backups of the data that is actually changing. SQL Servers backup command is incredibly flexible so get to know your data and design a plan that gives you the best bang for buck rather than persisting with a generic plan on all your databases.

AX Cannot Create Table in TempDB error – Resolved

We’ve been migrating a client to new AX servers and we hit an error that I’ve seen a few times now where AX attempts to create a table in the tempdb database and is denied permissions.

Object Server 01:  The database reported (session 14 (user)): [Microsoft][SQL Server Native Client 10.0][SQL Server]CREATE TABLE permission denied in database ‘tempdb’.

This error is due to AX’s method of using persistent ‘temp’ tables in the tempdb database(The same design decision that makes high availability for AX such a chore), and is not difficult to resolve.

To understand the problem you need to know a little about the way TempDB works.  It’s designed as a container for temporary objects and overflow operations, and to ensure it is tidied up it is recreated each time a SQL Server is started.  This means that as well as all tables in the database being removed, that all user permissions granted on the table are also wiped.  AX uses this database in a slightly non-standard way, creating a group of tables on connection and persisting them until reboot.  If these tables are removed(for instance by a FCI or AOAG failover) a number of errors of the ‘table not found’ variety will occur resulting in an AOS reboot being required.  Additionally, when SQL Server restarts(or fails over) the rights on tempdb need to be reapplied so when the AOS reconnects it can recreate all it’s temporary tables.  So having that background:

Solution Number One – Stored Procedure in Master Database

First identify the account which requires the permissions to tempdb in the first place.  This should be the service account running the AOS.  Then create a stored procedure in the master database such as the below:

CREATE procedure [dbo].[CREATETEMPDBPERMISSIONS_tempdb] as

begin

exec (‘USE tempdb; declare @dbaccesscount int; exec sp_grantlogin ”mydomain\myAOSserviceaccount”; select @dbaccesscount = COUNT(*) from master..syslogins where name = ”mydomain\myAOSserviceaccount”; if (@dbaccesscount <> 0) exec sp_grantdbaccess ”mydomain\myAOSserviceaccount”; ALTER USER [mydomain\myAOSserviceaccount] WITH DEFAULT_SCHEMA=dbo; exec sp_addrolemember ”db_ddladmin”, ”mydomain\myAOSserviceaccount”; exec sp_addrolemember ”db_datareader”, ”mydomain\myAOSserviceaccount”; exec sp_addrolemember ”db_datawriter”, ”mydomain\myAOSserviceaccount”;’)

end

EXEC sp_procoption N'[dbo].[CREATETEMPDBPERMISSIONS_tempdb]’, ‘startup’, ‘1’

This will fire whenever SQL Server starts and create the user and assign the permissions required.  Obviously replace mydomain\myAOSserviceaccount with the account used to run your AOS.

Solution Two – The Sledgehammer

There’s various reasons where the above won’t be ideal.  Multiple AOS service accounts, policy disallowing user stored procedures in system databases etc.  You can achieve the same result by granting the login(s) that require access to the model database.  This should then pass through to the tempdb database when the server is restarted.

Solution Three – The Non-DBA approach

I’m a database guy and not an AX guy, so I tend to look for database resolutions to problems.  I believe – but have not had confirmed, that a full database sync will also resolve this issue.  I haven’t seen this sorted from the application side but have read it in a few places now.  And if it’s on the internet it must be true.

 

Backup Questions to ask – Part One: Division of Responsibility

I’ve written a bit about various backup products for SQL, but I wanted to extend on that a little bit to focus on some of the specifics. In this post we will be talking about the division in responsibility for backups and restores between the DBA and the system administrators.

These two people are often the same, but in this post I will treat them as if they are two different people. At a minimum it is two different functions of their job, and I’m a believer that if you have to wear many hats you yourself should act as though you have two jobs. That way when your boss comes along and offers you another employee to help you out, you can just hand over one of your hats like it’s a pre-packaged job.

Despite the improvement of specialist products for taking SQL backups, I still have a strong preference that native SQL Server backups are used. If pushed I will say – the best backup solution is probably the one which the person doing the restore is most familiar with, which is a slightly different proposition, but I’m going to assume the person responsible for the backups should be the DBA, because they are the one most likely to be in the firing line if the database can’t be restored.

I try and simplify backups to something approximating the below:

  1. The DBA takes native SQL Server backups to a local disk. These backups are retained for X days.
  2. The system administrator is responsible for taking a copy of the native SQL backups from the local drive and copying them to long term storage, where they are retained for X2 days\months\years.

It then follows that if a restore is needed to a database server within X, then the DBA can do that directly from disk. If a restore point is between X and X2, the system administrator can restore the backup from long term storage back to the local drive, where the DBA can proceed with the install.

Then all that is required is defining X and X2, and scheduling in occasional test restores to random dates to prove the whole system works.

Now most people reading this will (I hope) be waiting for something interesting in all this. It’s an incredibly simple setup, very clear who is responsible for what and if a backup can ever not be restored – it should be very clear as to who has dropped the ball.

But here’s the thing, in my role I get to see an insight into multiple different SQL environments, and it seems even this very simple process can get horribly tangled up resulting in a whole lot of confusion, finger pointing, and ultimately data loss. The situation is complicated enough when one person is responsible for both jobs. WHen it’s two people it becomes a little murkier, and when it’s multiple people across multiple business units then it quickly sinks to a level where it’s really easy to find you don’t have a backup right at the most critical point – when you need a restore. Add in 1 or more external vendors and things can quickly spiral out of control.

Do you know your restore processes? If I was to ask for a restore of a given database to a given date could you tell me?

  1. Who I should actually be asking?
  2. What is the longest ago I can restore to?
  3. What points in time can I restore to? Real time? Nightly backups? Monthly archives?
  4. Who needs to authorize a production restore?
  5. Who needs to perform a production restore?

If all of those questions are not immediately clear, then I’m going to have to raise the D word – Documentation. Someone needs to take responsibility for documenting the backup and restore processes. They need to interview all the people we think are involved in the process and get a really clear understanding of what they are doing, what they think other people are doing and then where there are any mismatches.

As a job – it sucks, but it can be quite enlightening. And it’s better doing a job that sucks that averts a disaster, than not doing the job that sucks, and finding out it sucks not to have a job.

Always On Availability Group Secondary Becomes Unreadable

An interesting issue occurred in a client environment today.  The configuration was an Always On Availability Group which had a primary replica(Also a SQL cluster), a reporting replica(non-voting) and a DR replica in a different data center.  There was a cluster issue the previous night which saw the report server removed from the cluster.  The cluster and the availability group both recovered within 60 seconds.  However one of the databases in the AOAG could not be queried from.  The Availability group confirmed the database was healthy and data was transferring to it fine, but any select query just ran forever.  I ran through a list of checks, scanned the error logs and then went through a number of fruitless steps to get things running again:

  • Consistency check on the database was clean.
  • Database exhibited the same behaviour when removed from and then returned to the availability group.
  • Database exhibited the same behaviour when the replica was removed from and added back into the availability group(Not that I expected any different after the previous test, but if we didn’t try things we thought wouldn’t work we’d never solve half our IT problems).
  • I even transferred the DR server to readable state and confirmed the same behaviour existed there.

That all amounted to quite a bit of wasted time trying to find a quick fix, rather than really understanding the issue.  So I checked exactly what the select thought it was waiting on:

HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING

The wait type HADR_DATABASE_WAIT_FOR_TRANSITION_TO_VERSIONING is not all that well documented but I got enough information from this post and especially this one too to see where the issue lay.

There was an open transaction on the production database.  Until that was either committed or rolled back the secondary replica cannot use snapshot isolation transaction level, and therefore will not return a response(either error or result set) from the SELECT statement. We tried to stop the transaction from the application side, but ultimately ended up issuing a KILL command from within SQL.

 

It’s a Corrupt World

Luckily, in this day and age we don’t see so much database corruption, but it still sends a chill down the spine every time it pops up.  Having dealt with a number of cases in a number of different places now it’s not the corruption that bothers me so much as the pure amount of time it takes to fix.  Any unscheduled application outage is going to make you the focus of some very close scrutiny, and there’s going to be a lot of pressure to get things resolved as quickly as you can.    I’m going to assume you have detected the corruption with DBCC CHECKDB and have a minimum repair of repair_allow_data_loss.  Here’s some tips to get the application online as quickly as possible.

1.  Don’t try and get the application online as quickly as possible.  That way leads to mistakes and can compound the error.  Your goal should not just be to make this inconvenient problem go away, but to identify its cause, scope and resolve it in the best way possible.  Having identified that corruption exists your first move should be to make sure everyone who uses the application is aware there’s a problem.  At this point you need to assume that you are going to be dealing with dataloss.  If you don’t talk to your users everything that is going into the database can potentially be a further headache for you later.

2.  Start the process of retrieving or securing backups.  Hopefully you are running regular DBCC checks.  It’s worth starting the process of getting all backups between the last known good check and the time the corruption occurred.  Also, move those backups somewhere secure.  There’s nothing worse than seeing your (possibly) good backup suddenly get overwritten with a (definitely) bad one.

3.  Isolate the corruption.  You need to know where it is.  Once you know, the first step is NOT to fix it.  Your first step is to relay that information back to the business(or users) and let them know where the problem lies.  At this point get the application support guys in and confirm the impact of data being lost in that particular table.  You can save yourself hours with an informed application support person.  In the instance of corruption I was dealing with today it looked like a financial transaction table and the first 5 rows had 2 $70,000 transactions.  Not a table I wanted to lose data from.  The application guy was able to advise it was a table used only for logging which was usually cleaned up by an archiving process that just hadn’t been turned on for this server.  Knowing a table is essentially unused makes the decision process for the business very much easier.  (Unfortunately this particular application support person undid all his good work by following up with advice to the client to turn off checksum on all databases, a piece of advise akin to saying we can reduce the number of people convicted of murder by making murder legal).

4.  Now you have defined the problem, and the required fix, go ahead and fix it.

I know I made number 4 the shortest step, and it’s often very complicated.  But this is not a step by step technical guide to fix corruption in your database.  This post is about how to prep those around you to deal with the impacts of database corruption when it is identified, and after it is fixed.  In some cases the decision of how to proceed will fall directly to you, in others your recommendations will be considered and followed, and in others they will be discarded completely.  That’s the nature of working in a company, or as a contractor where other people have a vested interest in the database you are managing.  Leaving aside the technical aspects of the fix, it’s critical you manage people and expectation well.

Corruption is a big scary thing, but the biggest scariest bit is that when it strikes people have to make hard decisions, often with other people yelling at them or angrily wondering why their application is offline.  As a DBA, corruption is one of the scariest things you will face.  It’s not your fault(well, usually) but now you are the only one who can get the application back up and running.  This is your reason for being employed, and it’s natural to feel a lot of pressure to get things done as quickly as possible.  The best thing you can do…the very best thing…is not to try to do it as quickly as possible, but to do it as well as possible.  When this problem is resolved, be that in 30 minutes or after pulling an all nighter, there shouldn’t be anyone who is surprised by the outcome.  Just because the repair option is called REPAIR_ALLOW_DATALOSS doesn’t mean everyone will understand that data will actually be lost.  People are weird like that.  It’s your job to make sure that all interested parties know the implications of any fix you put in place BEFORE you put it in place.  It’s your job to present the problem, and then clearly explain the pro’s and cons of the possible solutions.

 

 

TSQL to get last good DBCC CHECKDB

This is a piece of code I use to determine the last known good DBCC CHECKDB being run against a database.  I’m surprised this information is so tricky to find.  I’d expect it to be sitting on the database properties tab right under last known full backup.  But it’s not.  Instead it is listed as a database Info property, and we need to jump through some hoops to find it.  To save anyone else jumping through the same hoops I’ve put my script below that gives the last known DBCC CHECKDB date for all databases on a SQL Server instance.

CREATE TABLE #DBCCs
(
ID INT IDENTITY(1, 1)
PRIMARY KEY ,
ParentObject VARCHAR(255) ,
Object VARCHAR(255) ,
Field VARCHAR(255) ,
Value VARCHAR(255) ,
DbName NVARCHAR(128) NULL
)

EXEC sp_MSforeachdb N’USE [?];
INSERT #DBCCs
(ParentObject,
Object,
Field,
Value)
EXEC (”DBCC DBInfo() With TableResults, NO_INFOMSGS”);
UPDATE #DBCCs SET DbName = N”?” WHERE DbName IS NULL;’;

SELECT DbName,Value as [Last DBCC] from #DBCCs
WHERE Field = ‘dbi_dbccLastKnownGood’
–AND Value < GETDATE()-14  –Uncomment to return only databases that haven’t had a consistency check in the past 2 weeks.

Drop TABLE #DBCCs

 

Now, I called it ‘my code’ but in reality, like all my code, it’s based on something shamelessly stolen from smarter peoples minds and shaped to my own purpose.  In this case the core of that little snippet is in Brent Ozars Blitz Script.  If all you want is the last date of a consistency check then by all means help yourself to the above code.  But if you want to quickly identify a whole bunch of issues on a server you don’t know well then I’d suggest you take a look at what the full blitz script can do.

Optober – SQL sp_configure Options – Priority Boost

Optober rocks on today with a quick post on Priority Boost.  I hadn’t planned to deal with this one until much later in the piece, but I’ve been reviewing a very unstable server for a client today, and it tracks back to this option.  It’s another one of those things like ‘autoshrink’ which sound like a perfectly reasonable idea until you understand what it actuially does.

warning

Danger – Only enable this setting if you like running with scissors.

Option Name:  Priority Boost

Usage:  sp_configure ‘priority boost’, 1

Default: 0

Maximum: 1

Requires Restart:  Yes

What it does:  Just like the name says, this option boosts priority of the SQL Server processes.  Which sounds like a good idea, but what does it actually mean?  Behind the scenes here is what is happening – SQL Server threads are being made uber important.  They get precedence over just about everything else.  Now this still sounds like a good thing right – this box is dedicated to SQL, so it should get precedence over everything.  Only unfortunately in this case everything includes the operating system.  The stability of the SQL Server is dependent on the stability of the operating system.  I spent 4 years in Christchurch when it was shaking like a frozen puppy, so I can tell you that when the foundations rock, buildings fall over.  The only threads that are a higher priority than SQL Server with Priority Boost enabled are the REALTIME_PRIORITY_CLASS level threads, and the usage of those comes with all types of warnings.  You can read more about Windows Operating System Priority Scheduling to see all the big flashy warnings about doing this in an application.  SQL Server is just another application and all the standard warnings apply.

If you do go ahead and enable it, at the very least, you can expect your interactions with the server to be unpredictable.  You could be moving your mouse or clicking the start button, but SQL has priority over that.  You could be opening management studio – but SQL has priority over that.  You could be trying to stop your SQL instance because it is running horribly, but guess what…anything running in SQL has priority over that too.

It gets even worse if you are using a cluster or availability groups.  Giving priority to SQL can mean that your cluster heartbeat is interrupted and your instance initiates a failover.  I’m sure you don’t want your HA continually initiating failovers.

When should I use it:  Alright…so if you got this far, you haven’t really absorbed the above.  This setting should never be used unless explictely instructed by a Microsoft Engineer to resolve a specific issue on a server.  Even then I suspect I would argue against with them before throwing the switch.  Take a gun and fire buckshot into your server from 100 paces – it’s probably less damaging than having this setting enabled for any extended period.

What else should you know:  There is some light.  This configuration is on the chopping block and will be removed in a future version of SQL Server.  But every silver lining has a cloud….I’ve been reading that on the books online page for Priority Boost for as long as I can remember.