Category Archives: SQL Server

Are You High? Can You Recover?

The below is an extract from my presentation at SQL Saturday 252 on high availability and disaster recovery.  This post deals with the concepts of RTO and RPO.

Let’s start by covering off just what we are talking about when we are using these terms – Higher Availability and Disaster Recovery.  Basically what we are referring to is a couple of acronyms – RTO and RPO.  That’s Recovery Time Objective and Recovery Point Objective.  Or put another way, how long can a service be down, and how much data loss is acceptable following an outage.

Now, in a lot of places these aren’t defined.  They are replaced by things like “best efforts” and the expectations are that you’ll get things up as soon as you can and lose as little as you can.  This is a dangerous mentality, because it should be the expectation anyway.  If you have a recovery time objective of an hour, and you get things fixed after 15 minutes, you aren’t going to look at your watch, see you’ve got another 45 minutes before your RTO runs out so go get a cup of coffee before you throw the switch.  You’re gonna turn stuff back on and then get the coffee.  Ditto for the RPO.  If you can retrieve all the data, you will do that, not delete or not restore up to the defined RPO.     Of course you are going to do things as quick as you can and lose as little as you can.

When I say it’s dangerous I’m looking at the outage that drags out.  If things take an hour to bring back online, or if 15 minutes of transactions are lost….or if you don’t know how much data is lost…then you put yourself in a position where someone can come back and question if your “Best Effort” was good enough, or perhaps if someone else might have done things better.

So you need to get these things defined by the business, or the application owner etc.  If they can’t give you hard numbers, or if they just drag the chain on doing it, then you need to give the numbers to them.  Tell them…”Okay, right now if this system goes down we stand to lose 15 minutes data and it’s take me an hour to bring everything back up.”  If they accept that – fantastic, you now have a documented RTO and RPO.  If they don’t then you’ve started the conversation about what additional resource you need to hit a different target.

Another important consideration for the recovery time objective is when the clock starts.  If you have an RTO of 1 hour – is that one hour from the time the issue arises, one hour from the time an automated alert is sent.  1 hour from the time you see that alert, or one hour from the time you actually get onto the server to look at the problem or one hour from the time the decision is made that you need to invoke DR.  Obviously this matters.

The bottom line is that if you don’t have those two numbers defined then you cannot define whether your higher availability and disaster recovery strategy is a success or not.

A note on ‘uptime’

Now when we define successful uptime it’s often defined as a percentage or number of 9’s.  Let’s take a quick look at that.  Let’s say you go to your boss and say we had 95% uptime.  That’s a nice number right?  On a test you’d be happy with a 95?  But in uptime terms that translates to 72 minutes downtime every day, or over 18 days a year of downtime.  So the more 9’s we add the narrower those allowances become.  At 99.999% downtime we are talking about less than a second per day and just over 5 minutes per year.  So basically you are talking one unscheduled restart a year.

What Disaster?

The final thing I want to talk about before we move on is the different types of disaster.  You need to think of HA and DR basically like insurance policies.  I live in Christchurch, which got rather smashed up in a recent earthquake, and once upon a time insurance policies were simple here.  You insured your house for a lump sum, and so long as you kept up your regular maintenance(payments) then you knew that should disaster strike your insurance company would come along and you’d have your new house up and running within a sensible time(Currently insurance companies are defining ‘reasonable time as around 5-10 years).  This is an example of responding to setting an RTO RPO with “Just do your best”.  Who’s to say 5 years isn’t reasonable if it’s not defined?  Nowadays you have to provide all types of addition information on that insurance policy – geotech reports and the like, because the Insurance companies have discovered that to replace a $400k house can often cost more that $400k.

In the SQL world it’s the same.  We need to consider more than the SQL server when we are talking about these things, because SQL Servers are not independent of everything else in your infrastructure and they are affected by different types of ‘disaster’.  You can have your server go down.  You can have a patch fail.  You can have your service account password changed and prevent SQL from starting.  You can have network or active directory problems.  Or you can have a data issue, where something that shouldn’t have been deleted is, where a “WHERE” clause is missed when executing a query.

And don’t forget this(We are in Christchurch so we are thinking of big natural disasters, and we are in IT so we know that sometimes hardware and software don’t do what you expect):  But not all disasters are accidents.  There are people out there who can and will deliberately do damage to your system – and anyone who came along to the SQL User Group a few months back will be able to see just how easily that can potentially happen.

So as we move through the rest of this presentation remember…..there are different sorts of disaster, and we need to consider those in our HA and DR strategy.

How to Use TSQL to Get Database Backup History

Having a record of the database backup history is particularly useful.  I use the following script a lot, but I find it’s main use is when I am logshipping and SOMEBODY breaks the recovery chain by taking an extra log backup.  This can give you an idea of who did it and where the log backup went.

SELECT TOP 50 bs.database_name, m.physical_device_name,
bs.user_name, bs.backup_start_date,
CASE bs.[type]
WHEN 'D' THEN 'Full'
WHEN 'I' THEN 'Differential'
WHEN 'L' THEN 'Transaction Log'
END
AS BackupType
FROM msdb.dbo.backupset bs
INNER JOIN msdb.dbo.backupmediafamily m ON bs.media_set_id = m.media_set_id
WHERE bs.database_name = 'AdventureWorks'
ORDER BY backup_start_date DESC, backup_finish_date
GO

If you are just after the last backup date of each database take a look at this one:

select T1.name as [DatabaseName],
MAX(T2.backup_finish_date) as [Last Backup Date]
from sys.databases T1 LEFT OUTER JOIN msdb.dbo.backupset T2
on t2.database_name=t1.name
where t1.name not in ('model','tempdb')
and t2.type in ('D', 'I', 'G','F') — Data, incremental, file/filegroup or file/filegroup incremental
group by T1.name
order by t1.name

How to Use TSQL to Monitor Index Fragmentation

The script I use to monitor fragmentation is a very slightly modified version of the one found here which gives me a little additional information and excludes a bunch of indexes I probably don’t care about:

SELECT OBJECT_NAME(ind.OBJECT_ID) AS TableName,
ind.name AS IndexName,indexstats.index_type_desc AS IndexType,indexstats.page_count AS [Page Count],
indexstats.avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(),NULL,NULL,NULL,NULL) indexstats
INNER JOIN sys.indexes ind
ON ind.object_id=indexstats.object_id
AND ind.index_id=indexstats.index_id
WHERE indexstats.avg_fragmentation_in_percent> 10
and indexstats.page_count>500
ORDER BY indexstats.avg_fragmentation_in_percent DESC

According to Paul Randall’s blog post on Where the Books Online Index Fragmentation Thresholds Come From, the generally accepted guidelines are to rebuild indexes larger than 1000 pages that have more than 30% fragmentation.  So in the query above I have trimmed out indexes with less than 500 pages and included page count in the output.  You can go to 1000 pages if you want, but I like to start with a more aerial view when I have to troubleshoot index fragmentation.  It’s also worth reorganising Indexes which are fragmented below 30% – according to the above article anywhere between 5 and 30% should be reorganised.

Paul also makes a really important point in the article quoted above, and that is that these numbers are made up to give us a starting point.  So use the script above freely, but don’t forget there are a lot of other factors involved in whether the indexes it returns are causing you problems, or your constant attempts to get them sitting nicely are actually doing nothing to improve your server performance and are just sucking up your time and effort.