Tag Archives: #tsql2sday

Why “Turn It Off and Turn It On Again” Still Works – Except for databases!

For this months TSQL Tuesday, Joe Fleming asks us “What’s a strange or unique problem you encountered and resolved, and how did you identify the cause/resolve the issue. What was your thought process?”

I am taking a slightly different tangent. My problem is neither strange or unique – in fact it’s infuriatingly common and it stems from one of the most common troubleshooting techniques in IT. While asking users “Have you tried turning it off and on again?” is a common go to for tech support call handlers, it is not a great idea when the “it” you are talking about is a database server

My story has happened to me in several forms over the years, but I still remember the first time it did. I had to deal with a problem where users reported a seemingly random set of job failures. Sometimes ETL jobs completed, sometimes they didn’t, sometimes they errored, sometimes they didn’t. Most systems have fairly useful logging of what is happening and SQL Server is no different.

The agent jobs that people were talking about had histories that regularly just stopped. Some had weird last run statuses of cancelled, or in one case running when there was no process attached to it. I didn’t need to look too much further at this point, but the SQL Engine error logs confirmed what I suspected – the server was being automatically started every night, and jobs were trying to run over this period.

It’s always nice to start a relationship with a new client with an easy win like this. Within about 5 minutes I could tell them what was going on, could point at the process that was doing it and had a clear path to resolve their sporadic load failure issues. Then they asked me the question “Why was this put in place?”

That’s a very difficult question to answer without having any history of the server, the jobs it has run previously and the people who had worked on it. But there was a clear clue in this case. The restart script contained a comment:

#Restart to help memory problem

Oh boy. I can only theorize at this point, but it seems like the old kneejerk to “SQL has a memory leak” response. Left to it’s own devices, SQL Server will happily grab up to the maximum amount of memory available to it (Mostly defined by the Max Server Memory setting), it’s very common for non DBA’s to see SQL using a huge chunk of memory, even when it is doing nothing right at that moment, and figure they can fix the problem by rebooting. That’s not a great response. That memory cache is a feature of SQL and what allows it to serve queries faster, not some bug or memory leak.

So in this case we could move jobs around to avoid the restart window until we’d gathered enough data to prove that the restart itself was not needed, but reflecting on it does raise the question – “Why does turning it off and turning it on again work so often?”

Here’s a subset of the reasons I can think of that a restart can work:

Stuck processes are cleared. All those background processes that can conflict and cause blockages are cleared and start from scratch.
Memory is cleared. Memory management is really complicated when you get into it, and over time inefficiencies in what is in memory and whether it needs to be can build up.
Re-establishes network connections. Networks are also complicated, and restarts will often reclaim IP and register devices, resolving some of these conflicts, or at least re-registering devices.
Updates and software glitches. There may be unapplied updates waiting on a restart to take effect. I’ve seen multiple cases where the very functionality that is supposed to avoid interrupting your workflow by holding off a restart until “a time of your choosing” actually stops it dead by breaking network connectivity or some other dependency. Likewise software bugs are not always there when you turn software on. Hopefully if they were that would be picked up in regular testing pre-release. More often a bug is triggered by a certain usecase or set of parameters. This is cleared with a restart.

In short, rebooting is a quick way to return a device to a known, clean state. It’s a universal first step because it’s fast, simple, and often effective—even if you don’t know what caused the problem in the first place.

But databases are different. For a start they are usually multi user devices, so turning them off affects every system connected to them, and turning them on again can often be a slow process. Again, this is not a bug, this is built into SQL to maintain consistency. You WANT SQL to play through the logs and bring the database to a consistent state before allowing access to the system. Consistency is only one problem that a reboot can cause, there’s a bunch of others.

Multi-System integration – SQL itself is designed to return the database to a consistent state following a transaction, but a SQL database is not a standalone system, it is often one of many that form an intricate web of data movement. Oftentimes complex systems have a very specific order that things need to be shut down and started up in to bring back online cleanly.
Connection Persistence – Some applications maintain a persistent connection and expect that connection to be there. One not uncommon Microsoft Enterprise Resource Planning system has the application create objects in TempDB when it first connects to the database. If the database server restarts unexpectedly these object dependencies must be known and the creation scripted to ensure application functionality.
Application State Management – Some databases are state management engines for their application, meaning the current state is stored and the process will continue following the restart.
High Availability implications – whatever happens on the primary server is also duplicated to secondary servers. This can lead to extended recovery periods as transactions need to be replicated to remote secondaries.

And of course the fact is that many servers are consolidated, running multiple systems, so if you restart to solve your problem, you are affecting multiple other systems at the same time. A restart is an outage on every system affected by it and may be breaching an uptime SLA on someone else’s system you aren’t even aware of.

Finally, for a user system, or even an application server, you can get away with the approach that says you just need things to work again as quickly as possible. For a database server that’s also true, but you want to understand why a problem arose in the first place too. Consult a DBA, find the root cause and fix it in a manner that you don’t see it again. Don’t just jump to “Turn it off and turn it on again.”

SQL Agent, Maintenance Plans, Auto-Heals and How to Waste an Afternoon

1 Reply

At my company, we’re lucky to have a custom tool that monitors EVERYTHING happening in our SQL Server environment. But every now and then, I like to explore what’s possible using only SQL Server’s built-in features-just in case you don’t have a comprehensive monitoring suite at your disposal.

Recently, someone made an off the cuff comment about using TSQL Tasks in Maintenance Plans to handle more complex logic that SQL Agent handled out of the box. I was briefly excited by the prospect of building improved logic flows directly into SQL Server Maintenance Plans. This months Andy Levy hosts TSQL Tuesday and asks us about how we handle SQL Agent Jobs and this seemed like a great opportunity for me to share a story about how I wasted an afternoon testing a few components I have a passing knowledge with, attempting to implement auto-healing and conditional logic flow, only to snap out of it a few hours later when I realized that I was trying to solve a problem that someone else cracked nearly 2 decades ago.

The visual workflow designer and conditional branching on success or failure in Maintenance Plans seemed like a great way to automate not just routine tasks, but also smart responses to issues as they arise. I love any opportunity to implement automatic healing into anything that is likely to alert me. Getting a message saying “A thing nearly broke last night – the problem is fixed” is so much better than getting a page to fix something that retrying or a known remediation action needs to be manually performed. So I looked into a little side by side comparison of SQL Agent vs Mainteance Plans to address a scenario where a DBA performs the same boring checks and actions when a backup fails, and how both technologies allowed for implementing retries and logic flow to handle this automatically rather than have a human intervene.

The flow for this process was good at first, but as I dug deeper, I realized that while Maintenance Plans can be powerful, you can achieve the same (and often better) results with a little creative thinking in SQL Server Agent Jobs. In fact, Agent Jobs offer even more flexibility, allowing you to mix and match different step types-T-SQL, PowerShell, SSIS packages, and more-within a single job. Let me walk you through what I learned, with some real-world examples.

Maintenance Plans: Visual Logic, Limited Flexibility

Let’s start with the Maintenance Plan approach. Here’s a screenshot of a workflow I put together to handle database backups with built-in remediation and notifications but the same ideas can apply to ETL or other processes.

The plan tries to back up all user databases.
If the backup succeeds, it updates a central monitoring table.
If the backup fails, it runs some auto-remediation steps (like clearing disk space or restarting a service), then tries the backup again.
If the second attempt succeeds it updates the central repository, and if it fails, it notifies the on-call operator.

This is a solid approach, and the visual editor makes it easy to follow the logic. But there are some limitations:

You’re mostly limited to backup, maintenance, and T-SQL tasks.
Complex branching or looping is awkward.
Integrating with external scripts or tools isn’t straightforward.

Agent Jobs: The Real Powerhouse

What’s more, my inital enthusiasm around the workflows was misplaced. SQL Agent actually has success and failure branching built in also at a per step level, it’s just in most cases the failure action is to say the job failed. But it doesn’t need to be.

With Agent Jobs, you can:

Chain as many steps as you want, each with its own success/failure logic.
Use different step types (T-SQL, PowerShell, CmdExec, SSIS, etc.).
Easily implement loops, retries, and advanced remediation.
Integrate with external systems, scripts, or APIs.

Here’s a screenshot of a way of achieving a better outcome using a non-standard workflow with SQL Agent. The job again handles backups with layered remediation and targeted notifications:

Reset Autoheal Remediation Counter – Prepares the environment for a new backup attempt.
Check Retry Loop – Determines if a retry is needed based on previous attempts. If the number of retries is exceeded the job generates a failure which redirects to the final step(Which is configured to alert the oncall operator)
Backup User Databases – The main backup step.
Update Central Reporting Repository – On successful backup the job proceeds to this step which logs the result and ends the job.
Initiate Backup Drive Maintenance – If the backup fails, attempts to resolve likely issues.
Perform Additional AutoResolution Tasks – Further remediation if needed. Following the completion of this step the job is redirected back to step two which increments the loop counter and retries the backup.
Job Failed Notification – If all else fails, notifies the operator and ends the job.

Each step specifies exactly what should happen on success or failure, making the workflow both robust and transparent.

Why Agent Jobs Win

After experimenting with both approaches, I decided I hadn’t infact finally found a use for Maintenance Plans. Really the only thing they have going for them in this scenario is a nice graphic interface to visualise what is going on. That’s not to undersell the importance of a clearly identifiable workflow. The job step approach is very hard to follow if your workflows become more complicated. But here’s why I think Agent Jobs are the smarter choice for most advanced workflows:

Multiple Step Types – You’re not limited to T-SQL or maintenance tasks. Want to run a PowerShell script, call an external API, or execute a batch file as part of your remediation? No problem.
Granular Control – You can direct the job flow on a step-by-step basis, allowing for complex logic, conditional retries, and targeted notifications.
Easy Integration – Agent Jobs make it simple to integrate with external monitoring systems, send custom alerts, or update centralized logging repositories.
Maintainability – It’s easier to document, update, and troubleshoot jobs when each step is clearly defined and can be of any type you need.

Conclusion – So which should you use?

So obviously I’m leaving out some additioanl options in my side by side test. SQL Agent Jobs have auto retry built in at the per step level if you want, and the loop doesn’t have to be at the task or job step level – you can build that right into the TSQL if you like. I played around with both of these approaches for a while, trying to see if I could refine them to resolve the issues that I came up with. I really wanted that graphical format, but I also wanted the flexibility of the SQL Agent jobs. Which should I use?

When you delve into these things it’s really easy to get stuck with your blinkers on. The answer here was pretty clear. Microsoft actually built a whole tool for us to handle SQL data movement with complicated workflows and event handlers that has a flexible graphical presentation layer – and that tool is called Integration Services. Yes SSIS has some of it’s own issues but flexibility of logic flows is not one of them.

Another afternoon frittered away investigating a problem that was solved years ago….

TSQL Tuesday 184 – Mentoring, Sponsorship, Rubber Ducks and Mates

1 Reply

This month Deborah Melkin takes a turn as the host of TSQL Tuesday and asks some interesting questions about Mentorship and Sponsorship. When I started this blog it was mostly as a place to drop bits of code and technical how-to’s and largely has stayed away from personal development, leadership and career planning, largely because I’ve never considered myself any good at any of that stuff. And yet when I look back at my career and life(something I am doing far too much lately due to a big scary birthday looming in November…) it’s very obvious that certain relationships I have formed have been very influential on the way my career path has developed and where I am now.

For me, these relationships have been fairly organic, and I’ve largely never thought of them as Mentoring – rather an open exchange of challenges and solutions. While there are definitely people who have been in a more traditional mentor\mentoree relationship with me, I’d like to talk today about some friends in a less traditional ‘mentoring’ relationship. Actually – I want to talk about having lunch with your mates.

For over a decade, I have two friends who I will grab lunch every fortnight or so and the conversation flips fairly naturally between general friend chat, personal development and specific work issues we are facing. We are all far enough removed in the tech world not to directly influence each others careers, but close enough that the problems we encounter tend to be the same type of thing.

What’s interesting is that there’s no Mentor and Mentoree in these scenarios, just a group of guys who are hitting the same problems at different times and exchanging ideas on how to resolve them. Quite honestly I think any of us would be horrified to have the label ‘mentor’ slapped on us for these lunches, but they provide so many of the benefits.

I’ve heard mentorship described in a number of different ways, but one thing that has stuck with me is a discussion I had some years ago where I suggested someone had acted as a mentor in suggesting a course of action to resolve a problem. I was corrected, and told they had tutored me(given me a solution) rather than mentored me(empowered me to find the solution myself). This distinction is what I get from my regular lunch catchups. Sure – we give advice, but there’s never an expectation that the advice is any more than a suggested approach. As I say we are far enough apart in our job roles that our solutions are usually not specific implementations, but general approaches. The final step of taking the advice on board and acting on it always sits on the person with the problem, loosely ‘the mentoree’.

Furthermore, there is some accountability built in. My friends have a shared mindset when it comes to how much control we have in our lives. Specifically – we all are of the belief that we have the ability to make decisions and take actions and that the consequences are on us. None of us accept that we are stuck where we are, we all know that if we want to be doing something else we have the skills and experience to do it. As such if the same problem comes up two lunches in a row the answer is not usually a suggestion it’s a challenge – “What did you do to solve this since we last spoke about it?”

There is a developer technique known as ‘Rubber Duck Debugging’ where a problem should not be escalated to a senior resource before it has been explained to an inanimate object such as a rubber duck, or rock. The idea is that verbalizing the problem forces them to slow down and think through the steps and logic, which can often lead to discovering the solution. It’s a form of self-explanation that helps clarify complex ideas.

Lunch with trusted friends who are ready and willing to hear you explain your work problems is another form of Rubberducking. Often it’s just a case of taking the time to talk through something, and the solution presents itself. Other times the solution is already there, but you are just seeking the validation that you have reached the right conclusion.

I think there’s a key point in there somewhere about the role a mentor should play. They should be someone that you trust, someone that you respect, someone that has your best interests at heart and someone who will call you out if you are going off track. These are all attributes I look for in friends too.

I fully support the more traditional approach to Mentoring, and applaud the efforts that people like Deborah are making in helping people find these relationships through tools such as WITspiration, but I also encourage people to not get too hung up on the label of ‘Mentor’ and realize that mentoring doesn’t need to be a big formal arrangements, and that if you are prepared to open yourself up a little you can gain many of the benefits from a group of like minded friends.

It Depends

Rob's SQL Server and Technology blog