The below is an extract from my presentation at SQL Saturday 252 on high availability and disaster recovery. This post talks about SQL Server Failover Clustering.
This post is going to cover the basic consepts of Windows Clustering. So, while logshipping is definitely a disaster recovery strategy, clustering is all about availability. With clustering you have two (or more) windows servers and they form a windows failover cluster. On top of that you are going to install a SQL Server failover cluster. When you do this it needs to be specifically installed that way. So with the other technologies we are talking about you always have the option of saying later – let’s add mirroring, or let’s add some logshipping to that. You don’t have the ability to do that with clustering – you need to plan it from day one. If you have an existing SQL server and you want to cluster it then you are basically looking at installing the cluster and doing a server migration.
Clusters are much easier to setup now than in previous versions of SQL, but there’s still quite a bit involved. You start with two(or more) windows servers and you form a Windows Failover Cluster with them. Then you assign some shared disk which is accessible to all the nodes of the windows cluster. After that you install a SQL clustered instance and that becomes the owner of the disk resource you added. When you do this you have to configure a few things at the active directory level – a cluster name, some IP addresses and a handful of permissions. You can then add as many other sql nodes as your SQL edition allows. Now from SQL 2005 you can have a 2 node active passive cluster with Standard Edition. In 2000, and in SQL 7 you needed Enterprise. There’s still a good advantage to using enterprise cluster, because you can have many more nodes in your cluster.
With a cluster there is a single set of data which is stored on the shared disk. If one node fails, or if you manually fail over to another node then the disk is detached from the current node, attached to the new node and the SQL Service is automatically started. This has a whole bunch of advantages. First of all, you have a virtual instance name. You point the application at that and then it doesn’t matter which actual windows server owns the disks and the sql service – active directory and the windows failover cluster sort that out and direct your application to the right place. It also means that if you need to schedule downtime on an instance – for patching as an example – you just move the service to a new node.
This is increasingly important as people look to consolidate their databases on bigger and bigger SQL Servers. When we are talking 64 bit architecture, multi-core processors and high end SAN back ends the limitations around hardware are far less than what they have been previously, so you can have a single SQL Instance supporting multiple high usage applications. So it’s more and more common to see these big general usage SQL Servers. But you know what they say about great power…….the server responsible for this increased load is increasingly important. It becomes more difficult to schedule downtime because you need to co-ordinate that downtime with multiple different application owners or business units.
I have a little professional disagreement with a colleague over this. His view is that if you have to patch 20 servers supporting one application each and need 1 hour of outage time to do that patch then you need to arrange 20 hours of outage. If you have them all on the same server there’s still 20 applications to arrange an outage with so you still have to arrange a 1 hour outage with 20 different applications. That’s true – but it’s also not the point. The point is that those 20 outages need to be AT THE SAME TIME. And that’s where it gets tricky, because in order to load balance your business use, and your data loads and your backups and….all the other scheduled stuff that happens to your server, you will have spread the load evening across the hours in the day. I hope. So there’s always likely to be something going on.
At any rate, adding a cluster gives you 2 big benefits here. Firstly there’s the obvious failover capacity if a disaster happens on a single server, but additionally your outage window now need only be the failover times, and believe me it’s easier to sell a 1 minute ‘micro-outage’ than a 1 or 2 hour outage.