Batching data manipulation is great as long as you do it correctly

As Sinatra was fond of singing, “regrets, I’ve had a few.” And one of those came this week immediately after my presentation. This isn’t to say the presentation didn’t go well – it did. At least I think it did. I had several people come up and thank me afterwards for showing them things they didn’t know previously, so in that sense it went well.

But in hindsight, one part of it, well, it sucked. So I’m going to take a moment here and now to correct that.

My presentation was about ways to improve performance for data manipulation, and one of the tips that I mentioned was breaking a statement into multiple “batches” by using WHILE and TOP. This can be helpful not only because the batched queries consume less resources like memory grants, but also because these smaller transactions will hold locks for less time, meaning they won’t be as obstructive to concurrent users. That’s win-win optimization, right?

Well, yeah, except when you talk about it but don’t demonstrate it correctly. My friend, I’ll say it clear: My demo was wrong.

To the demo scripts. Let’s do what we have to do.

First, let’s load 1,000,000 records into a table with a clustered primary key Id column and some other int value.

SET NOCOUNT ON;

CREATE TABLE SomeTable (Id int IDENTITY (1,1) PRIMARY KEY CLUSTERED, SomeId int); 

DECLARE @id int =1

BEGIN TRAN

WHILE @id <= 1000000 BEGIN

INSERT SomeTable (SomeId)
SELECT 0

SET @id += 1

END

COMMIT;

Next, let’s update 90% of the records this the basic way, as one single batch.

UPDATE SomeTable
SET SomeId = 1
WHERE Id > 100000;

That's pretty fast, but if we are updating wider records, and several hundred million of them, it would take considerably more time. Like maybe your hardware can't consume this single query kind so it runs to infinity kind of time. That's what we're trying to avoid.

So next let’s update the same 90% with a simple method of batching, the way I showed in my slides and demo. This is intended to be about the same time, but it most certainly is not.

DECLARE @RowCount int = 1
WHILE (@RowCount > 0) BEGIN

 UPDATE TOP (20000) SomeTable
 SET SomeId = 2
 WHERE Id > 100000
 AND SomeId = 1

 SET @RowCount = @@ROWCOUNT;
 END;

Here’s the problem: The whole point of batching is to use less resources and to avoid being obstructive to concurrent users. The script above that I used in my batch demo does THE EXACT OPPOSITE of that. Here’s why: if we script UPDATE TOP (20000) in a WHILE loop the way I showed then we are basically telling SQL Server to scan 20,000 records, update them, then scan again for the next 20,000 records and (here’s the important screw up in my demo) WE ARE STARTING EACH BATCH SCAN FROM THE SAME PLACE. This means each scan for the next 20,000 records has to re-read all of the data pages in the index that were already read by previous scans.

Put simply, it’s biting off more than it can chew. This is pretty much the opposite of query optimization.

It’s horribly inefficient because the very last scan will effectively read ALL of the pages into the buffer cache just like the basic query. With a lager amount of records we could run out of memory and spill to tempdb. And maybe even fill tempdb up. And not in a shy way.

And yet, this is totally avoidable. We just have to track the starting and ending point of each careful step along the byway of the batch in the WHILE loop using a Min and Max variable. These variables will help our query be precise about which records it is scanning in the main table, so we’re going match these variables against values in the clustered index of the table we are updating. (Or deleting, as Michael J Swart showed in an example a few years ago.)

Here is how I should have demonstrated this in my presentation.

DECLARE
 @BatchIdMin int = 100000
 , @BatchIdMax int
 , @RowCount int = 1;

WHILE (@RowCount > 0) BEGIN
 SELECT TOP (20000) @BatchIdMax = Id
 FROM SomeTable
 WHERE Id > @BatchIdMin
 ORDER BY Id;

 UPDATE SomeTable
 SET SomeId = 3
 WHERE Id > @BatchIdMin
  AND Id <= @BatchIdMax;

 SET @RowCount = @@ROWCOUNT;

 SET @BatchIdMin = @BatchIdMax;
 END;

If you run this demo yourself you’ll see it takes roughly the same amount of time as the original query. But now you don’t have to worry about blowing up the buffer cache or blocking other queries against this table for any number of minutes, hours, or whatever. This – THIS – is actual query optimization.

I still can’t believe I messed this up. Maybe when the tears subside, I’ll find it all so amusing.

One thought on “Batching data manipulation is great as long as you do it correctly

  1. Oof, I did something similar to myself earlier this year with the fun added flavor of an in-memory table! I budgeted what should have been room for a 10x spike in normal throughput… only to discover that I’d written for myself an N+1 style bug that ended in recurring OOM events (and tears).

Leave a Reply

Your email address will not be published. Required fields are marked *