How to tell if Change Data Capture (CDC) is working as expected

I’m not sure how many of you use Change Data Capture (CDC) on your instances, but I’ve had to support it for a while now and I thought I’d share a little script I use fairly frequently to help troubleshoot the capture and cleanup, as well as some advice for resolving issues.

Sharing is such a gentle trait.

Emulate

If you’re not familiar with Change Data Capture, this is not the post to introduce you. There’s a lot of helpful material including at least one video over at Microsoft Docs to get you activate it in your own databases.

This post is for folks who’ve already enabled CDC on some tables and maybe are being told by other folks may be in a less than perfect state.

A few quick things first, though:

  1. CDC is enabled at the database level. And then the table level, but we’re not going to suffocate on that today.
  2. Data changes are captured by a log reader agent, via either a database specific capture job OR an existing transactional replication job. (And if it’s the latter, I appreciate your predicament.)
  3. Regardless of what job is calling the log reader agent, the job should be running constantly.
  4. CDC data is cleaned up by the cleanup job, which by default runs once a day at 2:00am.
  5. The cleanup job also by default removes data older than 72 hours.

Now, you may have changed some of those defaults, depending on your requirements and constraints. That’s not important, but you should deliberate on these defaults if they have been changed and know what the changes are. This will help in identifying if CDC is working as expected.

Mediate

Speaking of troubleshooting, here’s a little script I use to find out if CDC capture and cleanup are working as expected. It returns the database name and the special dates and times of the oldest and most recent Log Sequence Numbers (LSNs) for captured data.

SET NOCOUNT ON;
 
DECLARE @cdcInfo TABLE (DatabaseName sysname, MinLSNTime datetime2, MaxLSNTime datetime2);
 
DECLARE @command varchar(1000);

SELECT @command = 'IF ''?'' NOT IN(''master'', ''model'', ''msdb'', ''tempdb'') BEGIN USE ?

IF (select is_cdc_enabled from sys.databases where [name] = ''?'') = 1 BEGIN

       DECLARE @cdcList TABLE (TableName NVARCHAR(128) PRIMARY KEY CLUSTERED, MinLSNTime datetime2);

       DECLARE @cdcTable NVARCHAR(128);

       INSERT @cdcList (TableName, MinLSNTime)
       SELECT
              capture_instance
              , sys.fn_cdc_map_lsn_to_time(sys.fn_cdc_get_min_lsn(capture_instance))
       FROM cdc.change_tables;

       SELECT
       DatabaseName = DB_NAME()
        , MinLSNTime = (SELECT MIN(MinLSNTime) FROM @cdcList WHERE MinLSNTime IS NOT NULL)
       , MaxLSNTime = sys.fn_cdc_map_lsn_to_time(sys.fn_cdc_get_max_lsn())
 
        END;
END' ;

INSERT @cdcInfo (DatabaseName, MinLSNTime, MaxLSNTime)
EXECUTE master.sys.sp_MSforeachdb @command;

SELECT DatabaseName, MinLSNTime, MaxLSNTime
FROM @cdcInfo
ORDER BY DatabaseName;

As a general rule, you want the oldest date (MinLSNTime) to be within 0-24 hours more than the cleanup value. For example, if you’re using the default of 72 hours, the oldest date should be 3-4 days ago. Because math.

Also, you want to see the newest value (MaxLSNTime) to be as close to now as possible. You probably want to see that within a few seconds, unless your tables don’t have much data changes happening right now.

Alleviate

Now, the fun part: troubleshooting. Here are a few tips to fascinate your brain.

If your MinLSNTimes are further back than your cleanup default, check for:

  1. Is the SQL Agent turned on? If this is off, CDC ain’t happening right now.
  2. Does the job “cdc.[DBNAME]_cleanup” exist, and is it enabled? If you’re using an Availability Group you might have failed over to a secondary that doesn’t have the job (it’s only automatically created on the primary).
  3. If the job has been running, is it running successfully? It may be that the process is blocked by some other process reading the capture instances (the “…_CT” system tables). If the queries are using good old “Read Committed” (yet another SQL Server default) then they can take locks on the capture instance and prevent the scheduled cleanup from occurring. You might have to get out your KILL command.
  4. If the job is running and appears to actually be churning through a lot of data, you may have had a massive data change event in the last 72 hours. This can happen on tables enabled for CDC that get truncated and reloaded, or have one column (like a modification date) updated for all records several times a day. In this case you should have a conversation with whomever requested CDC on said tablea regarding alternatives to CDC while you suffer through the delete churn. Try not to hate.

If your MaxLSNTimes are several minutes (or hours) back from right now, check for:

  1. Is the SQL Agent turned on? If this is off, CDC ain’t happening right now. Duplicate.
  2. Do you have a replication publication on this database? If you do you probably won’t have a “capture” job for this database, so you’ll need to make sure replication is working as expected. Fire up ye old “Launch Replication Monitor” if you don’t have any scripts handy to check this. You can also check for the replication job executing a log reader for this database, which probably has some cryptic name. Because replication is like that.
  3. If you don’t have a replication publication on this database, does the job “cdc.[DBNAME]_capture” exist, and is it enabled? If so, is it running? Again, if your database is in an availability group that had a recent failover, it won’t recreate automatically. Moreover, it probably won’t be running since the job’s default schedule is  to start “on startup” and not “on failover”. (As a side note, I will have a separate post about setting up CDC in an Availability Group.)
  4. If all of that is running, check your availability group secondaries to make sure they are receiving data as expected. If you don’t have handy scripts for checking that, check the Availability Group Dashboard and look at the “Estimated Data Loss” on the secondary. If it’s like several hours then your Availability Group isn’t happy. And if the Availability Group isn’t happy, then nobody’s happy. By that I mean your transactions are waiting in the (growing) transaction log on the primary before they can be moved along by CDC and replication, so you need to resolve the AG issue first.

I’m no CDC heavyweight, but I hope this information helps you someday!

8 thoughts on “How to tell if Change Data Capture (CDC) is working as expected

  1. Great post! Another problem I often hit with CDC is that when Database Availability Groups or Mirroring is paused (e.g. secondary is offline) then that stops the log reader from reading any data.

    1. hi Greg,

      Assuming your failover is manual and initiated, you should stop the CDC capture jobs on the principal server before the failover and restart them when the failover completes and the server becomes secondary replica.
      With an automatic failover, try adding the stop & start CDC capture jobs into some AG failover monitoring.

      1. You can add a “Am I Primary” first step to the job on each node. If not Primary, exit the job gracefully.

Leave a Reply

Your email address will not be published. Required fields are marked *