If you’ve ever had to implement Change Data Capture (CDC) for a database in an Availability Group, then you know that the CDC jobs don’t really consider the Availability Group. The capture and cleanup jobs created are set up as if the database exists only on a single instance.
And that’s a problem, because I would guess quite lot of databases are in Availability Groups. Maybe even some of yours. If you have this issue, I’ve put together a step-by-step solution in this post.
I know this post is a little long compared to others here, but I’ve tried to be thorough to save you headaches.
The Microsoft way
According to the documentation, the answer to getting CDC jobs working in an availability group is roughly to do these steps:
- Enabled CDC on a table on the current Primary, which creates the capture and cleanup jobs
- Initiate a failover to the current Secondary
- Execute sys.sp_cdc_add_job to create the capture and cleanup jobs on the new Primary
- Manually disable the capture and cleanup jobs on the new Secondary (a.k.a. the previous Primary)
This will in fact create the jobs on both instances, but in my experience I’ve found a few issues with setting up the jobs this way.
Issue #1 – the capture and cleanup jobs can’t be created on a secondary this way. This is because sys.sp_cdc_add_job only works on the Primary instance. If you execute it on the secondary it will give you an error. The designed way to create jobs on the current Secondary with this plan is to wait until you failover, presumably manually.
Issue #2 – the capture job doesn’t start on the new Primary after a failover. This is because when the capture job is created it only has a schedule to start “on startup,” and since you’re new Primary was running during the failover, you have to start the capture job manually.
Issue #3 – the capture job doesn’t stop on the new Secondary after failover. It will attempt to capture changes on what is now a read-only database, which will give you lots of unsuccessful attempts at execution. You have to stop and disable this job manually.
Issue #4 – the cleanup job on the new Primary is still enabled after failover. Like the capture job, it will attempt to run against a read-only database, which will not work. You have to disable this job manually.
Issue #5 – all of this can only work if your failover is done manually. If you have your Availability Group set to synchronous with auto failover, none of this gets done if an auto failover occurs. That means captured change data is sitting in your transaction log until you turn on the capture job on the new Primary…manually, of course.
That’s a lot of work done manually in a setup designed to be automatically highly available.
A higher availability way
Now, I fully admit the steps below aren’t the most elegant way to resolve this, but they will get you set up in a way such that you can sleep easier at night. And besides, they’re mostly copy/paste.
And if you’ve already attempted to set up the jobs on your Primary and Secondary/Secondaries, you don’t need to run all these steps. In fact, let me say that a bit louder.
IMPORTANT: If you already have capture and cleanup jobs on your current Primary and Secondaries, you many only need Steps 1 & 2 to make your CDC jobs highly available. All the steps are included in case you are enabling CDC on a database in an availability group for the first time.
Oh, and please be aware of “DatabaseName” in the steps below. These will need to be changed to the correct database name, unless you really do have a database named “DatabaseName”. But you don’t. Probably.
Ok, on to a better solution.
Step by step instructions
Step 1 – Update the job steps to be AAG aware on the Primary
First, we’re going to make sure our existing jobs are aware of whether or not they are executing on a Primary or not. We’re using the helpful function sys.fn_hadr_is_primary_replica in each job step to determine if this database is on the Primary, and if not then don’t try to do anything else.
Using our good friend copy/paste and SQL Server Management Studio (SSMS), let’s go to the SQL Agent and change the specific “cdc.[DatabaseName]_capture” job steps (if it exists, it might not if there is a replication publication) to the following:
Job Step 1 – Starting Change Data Capture Collection Agent
DECLARE @DatabaseName SYSNAME = DB_NAME()
IF (SELECT sys.fn_hadr_is_primary_replica (@DatabaseName) ) = 1
RAISERROR(22801, 10, -1)
Job Step 2 – Change Data Capture Collection Agent
DECLARE @DatabaseName SYSNAME = DB_NAME()
IF (SELECT sys.fn_hadr_is_primary_replica (@DatabaseName) ) = 1
EXEC sys.sp_MScdc_capture_job
…and then change the step of the “cdc.[DatabaseName]_cleanup” job as well.
Job Step 1 – Change Data Capture Cleanup Agent
DECLARE @DatabaseName SYSNAME = DB_NAME()
IF (SELECT sys.fn_hadr_is_primary_replica (@DatabaseName) ) = 1
EXEC sys.sp_MScdc_cleanup_job
Step 2 – Create an additional schedule on the capture job for every minute on the Primary
Remember, the capture job only has a schedule to start “on startup,” which won’t do anything if a failover occurs while the instance is up and running.
Because we want to CDC to capture changes without much delay, we’re going to add this additional schedule to the capture job to try to start every minute. If you think that’s too aggressive you can adjust it to whatever interval you want.
DECLARE
@Database SYSNAME = 'DatabaseName' /* Your Database name goes here */
, @job_id UNIQUEIDENTIFIER
SELECT @job_id = job_id
FROM msdb.dbo.sysjobs
WHERE [name] = 'cdc.' + @Database + '_capture'
EXEC msdb.dbo.sp_add_jobschedule
@job_id = @job_id
, @name=N'Start in case this becomes Primary'
, @enabled=1
, @freq_type=4
, @freq_interval=1
, @freq_subday_type=4
, @freq_subday_interval=1
, @freq_relative_interval=0
, @freq_recurrence_factor=1
, @active_start_date=20010101
, @active_end_date=99991231
, @active_start_time=0
, @active_end_time=235959;
Step 3 – Create the cdc_jobs table on the Secondary (if necessary)
Have you ever wanted to create a system table? This could be your big chance. The CDC jobs need the table msdb.dbo.cdc_jobs to execute, and this table only gets created when the jobs are created by enabling CDC on the first table in a database, or by using sys.sp_cdc_add_job. And we’re not doing either of those here.
IMPORTANT: if this IS NOT the first CDC job on a Secondary instance, then this step will not be necessary.
Conversely, if this IS the first CDC job on a Secondary instance, then execute the following there.
CREATE TABLE [dbo].[cdc_jobs] (
[database_id] [int] NOT NULL
, [job_type] [nvarchar](20) NOT NULL
, [job_id] [uniqueidentifier] NULL
, [maxtrans] [int] NULL
, [maxscans] [int] NULL
, [continuous] [bit] NULL
, [pollinginterval] [bigint] NULL
, [retention] [bigint] NULL
, [threshold] [bigint] NULL
, PRIMARY KEY CLUSTERED (
[database_id] ASC
, [job_type] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY];
Step 4 – Script the “cleanup” and “capture” jobs from the Primary and execute on the Secondary
No T-SQL required here. In SSMS go back to the SQL Agent and right-click on the job(s), then select “Script Job as” -> “CREATE to” -> “Clipboard” and paste into a connection to the Secondary. You’re going to execute in just a moment, but before you do there’s a quick warning.
IMPORTANT: before executing, search for any occurrences of the Primary instance name and replace with the Secondary instance name.
After that, then execute to create the jobs.
Step 5 – Prepare the database specific records for msdb.dbo.cdc_jobs on the Primary
Now that we have the cdc_jobs table, we need to populate it with records relating to the cleanup and capture jobs. If we don’t then they won’t work. Because rules. And because there are parameters in the table related to the execution of the jobs.
Execute this on the Primary, which outputs a create a small INSERT script, which will be copied and executed on the Secondary in the next step.
DECLARE
@Database NVARCHAR(255) = 'DatabaseName' /* Your DATABASE NAME goes here */
, @Fields NVARCHAR(200) = '[job_type], [job_id], [maxtrans], [maxscans], [continuous], [pollinginterval], [retention], [threshold]'
, @SQL NVARCHAR(1000);
SET @SQL = 'SET NOCOUNT ON;
DECLARE @Insert VARCHAR(MAX);
SELECT @Insert = ISNULL(@Insert + '' UNION '', ''INSERT INTO msdb.dbo.cdc_jobs([database_id], '
+ @Fields + ')'') + CHAR(13) + CHAR(10) + ''SELECT DB_ID(''''' + @Database + '''''),'' + '
+ REPLACE(REPLACE(REPLACE(@Fields, ',', ' + '', '' + '), '[', ''''''''' + CAST('),']',' AS VARCHAR(max)) + ''''''''')
+ ' FROM msdb.dbo.cdc_jobs WHERE database_id = DB_ID(''' + @Database + ''');
PRINT @Insert';
EXEC sp_executesql @SQL;
Step 6 – Execute the ouptut of Step 5 on the Secondary
No suspense here since we said what was going to happen in the last step. Just take the copied the output (an INSERT statement) from the last step and execute it on the Secondary.
Step 7 – Synchronize the Job IDs on the Secondary
One last thing here. When we copied the jobs in Step 4 they created identical jobs on the Secondaries…mostly. Unfortunately, they created their own job_ids for the jobs, which do not match the job_id’s we just created on the Secondary in Step 6. Sad frowny face.
The good news is that this is easily correctable. The better news is this is the last step Execute this script on the Secondary to synchronize the job_id values.
UPDATE c
SET c.job_id = s.job_id
FROM msdb.dbo.sysjobs s
JOIN msdb.dbo.cdc_jobs c
ON s.name = 'cdc.' + DB_NAME(c.database_id) + '_' + c.job_type
WHERE DB_NAME(c.database_id) = 'DatabaseName';
That was easy
OK, so now your CDC capture and cleanup jobs should be ready for the database to failover, manually or automatically. I know this isn’t a simple process, but it’s the best I’ve found. And yes, I know some of this can be done in PowerShell, but I decided against using that for any readers who might not want to go that route.
And if you’ve read this far, thank you for your endurance. Hope this helps!
ntry Time 2022-07-29 05:03:00.073, Msg 22903, Level 16, State 1, Another connection with session ID 85 is already running ‘sp_replcmds’ for Change Data Capture in the current database. For more information, query the sys.dm_cdc_errors dynamic management view.
getting this error with this method, how to stop this stored procedure when not primary
Hi I just did this on sql 2019 exactly as described and I ran into the capture job failed with Invalid object name ‘msdb.dbo.cdc_jobs_view’. so I created it as
CREATE VIEW dbo.cdc_jobs_view
as
select [database_id], [job_type], [job_id], [maxtrans], [maxscans], [continuous], [pollinginterval], [retention], [threshold]
from [dbo].[cdc_jobs]
Also I Created an alert to trigger on ag state change on both server
EXEC msdb.dbo.sp_add_alert @name=N’AG change state’,
@message_id=19406,
@severity=0,
@enabled=1,
@delay_between_responses=0,
@include_event_description_in=1,
@notification_message=N’ Availability Group has changed. Attempting to start cdc.db_capture job. Please check it started. Sending email.’,
@job_name=N’cdc.db_capture’
GO
I think this worked still going through my test results.
Sorry the view definition should have this as the first field distinct db_id(s.database_name) as database_id
Also They way the jobs works on the secondary it retries step 1 ten times and then retries step 2 ten times then fails throwing a job failed email.
I am going to rip the AG check out of the steps that do the work and create a new step 1 that does the AG check and fails on the secondary and have that step go to next step on success and quit reporting success on failure
That way job is not trying to do the capture on the secondary and I do not get false positive job failed emails.
Hi Jeff & Ed, did anyone get this to work?
Also if can comment about disabling?
After, disabled “…If Change Data Capture needs to be disabled on a database that is part of an Always On Availability Group, then you will need to perform more steps to ensure that log truncation is not affected.”
Quote is from https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/replicate-track-change-data-capture-always-on-availability?view=sql-server-ver16
This worked for me…kind of. What I am running into is that when a failover happens the agent job on the now Secondary server trys to run it’s code, but because the job step says to run on DB X and X is no longer available on the secondary, the job ends up running against DB X on the Primary.
I’ve verified this by making a job step that simply outputs @@ServerName to the log and when I run the job on the secondary, the output is the name of the Primary Server, but if I change the DB in the agent step to Master and re-run the job, it outputs the name of the Secondary server.
So, what this means for me is that the job on the secondary runs, but because it’s hitting the Primary, it’s identifying as being on the primary server and then running, so then the Agent Job on the Primary fails with the error that priyanshi noted previously (that another session is running ‘sp_replcmds’ ) and then the primary job fails and goes into retry mode.
I don’t know how to resolve this other than to try to create a job that will determine if a failover has occurred and disable the jobs on the secondary and activate them on the primary.
If anyone has also encountered this behavior and has a solution to make Jeff’s solution work, I’d be interested in hearing how you fixed this.