How one SELECT can affect the whole instance

So much of being a DBA revolves around responding to the phrase “The database is slow.” If I had a dollar for every time I heard that phrase I’d could probably buy that unreleased Wu-Tang Clan album. Not that I’d want to. But I could. I think. Wait. . .don’t leave. This isn’t a post about rap music, I swear.

Whenever I hear those four troubling words I am compelled to spring into action like Jackie Childs in order to quickly defend my client, the SQL Server instance. I check for troublesome connections with sp_whoisactive. I check the Error Log for deadlocks, restarts, etc. I look at Performance Monitor on the instance and look at the CPU and Memory usage, and then check the Network and Disk metrics to see if I can pinpoint a cause that isn’t the database. If there’s an Availability Group I’ll check that, maybe create a trace for something else.

I’m gonna stop now before this turns into a different post about troubleshooting.

The point it there are lots of tools to allow you to say “Your honor, my client is innocent of these ‘slow’ charges.” But sometimes, it really is the database that’s slow. And that makes the panda sad.

Many times this “slowness” is actually caused by delays that result from ad-hoc queries by users with good intentions and, um, let’s nicely say underdeveloped T-SQL skills. Typically they write a SELECT statement that somehow joins 15 tables. Or they don’t have a predicate to filter the results. Or they select all columns with the ubiquitous “SELECT *” syntax. Or maybe they did all of that.

Now, if you have adequate hardware and have configured your database settings adequately (think Max Memory is NOT set to default) then all of those issues are usually going to be limited in scope to only the database(s) containing objects in the query. But recently I saw a SELECT statement that managed to halt the majority of the of queries for the entire instance.

I gotta admit, I was a little in awe. I mean, as queries go this was quite the weapon of mass disruption.

The query was in a proprietary database, so I can’t show you the actual code, but the gist of it was something like this:

SELECT Col1, Col2, Col3….Col50
FROM TableWithHundredsOfMillionsOfRecords
ORDER BY
LTRIM(RTRIM(Col2)) COLLATE SQL_Latin1_General_CP1_CS_AS, 
LTRIM(RTRIM(Col3)) COLLATE SQL_Latin1_General_CP1_CS_AS,
LTRIM(RTRIM(Col4)) COLLATE SQL_Latin1_General_CP1_CS_AS,
...LTRIM(RTRIM(Col50)) COLLATE SQL_Latin1_General_CP1_CS_AS

It was bad enough this query was trying to read a zillion pages of data, but the real killer part of this query is the ORDER BY part. You see, when you SELECT all that data and choose to order the results in a way counter to the clustered index, your execution plan has to create a Sort operator to (of course) sort those results. And when the sheer number of pages are combined with all of the nonsense going on in the ORDER BY then that sorting couldn’t be done in memory – it had to go to tempdb.

Now, here’s where it gets fun. Actually, the opposite of fun because the tempdb data files for this server exist on their own drive with about 200 GB of space that never comes close to being used. Except with this query it did, and tempdb filled up in the middle of the sort, which meant tempdb was unusable for any other query that needed to use it. Everyone was stuck in traffic and the off ramps were closed.

Important reminder: queries from any database can objects in tempdb. With tempdb in this state that means no new temp tables, table variables, or anything else that uses tempdb. And just like that, I had a whole instance full of “the database is slow” messages. Good times.

By the way, this is one of the reasons why you don’t put every data file, log file, system file, and tempdb file on the same drive. Because then we’d have really been over a barrel.

If you want to know the solution, it was basically (1) kill the query, (2) create another tempdb data file on another drive so other queries could finish, (3) reduce the files to allow free space on the drive when the current queries completed, then (4) EMPTY and DROP that extra file.

Then move to (5) find who wrote that query and “Bring Da Ruckus” Wu-Tang style!

OK, not really. I will give them some, uh, constructive feedback on better T-SQL practices, but I’d do it gently since none us were born knowing this stuff.

5 thoughts on “How one SELECT can affect the whole instance

  1. Isn’t it absurd that one user can cause so much trouble?
    How can this be prevented, not just dealt with? For example, can SQL Server limited tempdb allocation by user or query?

    1. Query Governor might limit the effects of this kind of monster query, but we aren’t going to use it for this instance because there are legitimate queries that have pretty large costs and yet still don’t blow up the tempdb data files. Think large monthly report type of queries that run in the wee hours.

      It might work for others though, so that’s a good suggestion. I generally favor a warning with an explanation for the first offense, and revoking permissions on the second one.

  2. The tip to put different assets on their own disk is invaluable. We saw a very noticeable uptick in performance when doing this for our data warehouse. We then put a particularly heavily used schema in its own partition and drive, resulting in even better performance.

Leave a Reply

Your email address will not be published.