How to use Change Data Capture (CDC)
Often there are lot of changes made to the sources tables and you would like to have a history of all changes. First thing that comes into mind when you want to save all the changes (inserts/updates/deletes) to a table is triggers, or you may have to write complex stored procs.
What is Change Data Capture?
Change data capture (CDC) is the process of capturing changes made at the data source and applying them throughout the enterprise. CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
A Data Warehouse (DWH) must maintain the history of business measure changes. Thus, the ETL processes of the Data Warehouse loading must be able to detect the data changes which have occurred in source operation systems during the business operations.
Inserting new records, updating one or more fields of existing records, deleting records are the types of changes which Change Data Capture processes must detect in the source system.
Let’s start working on how to use CDC.
STEP 1: DB Set up
Source and Target Tables: I have a source table which has 6 records and an empty destination table.
We need to enable CDC on the entire Database by executing sys.sp_cdc_enable_db
To list all the databases that are CDC enabled we can use the below query:
SELECT name, database_id, is_cdc_enabled FROM sys.databases
WHERE is_cdc_enabled = 1
We need to create a table that can hold the current state for various objects.
This table will be used to track the CDC load information, so that you only pick up recent changes each time the incremental load package is run. It will be created in the Destination database. [CDC_States]
I have created unique non-clustered index on [CDC_States] to quickly look a state.
We need to enable CDC on the source table by executing
@source_schema = N'Source',
@source_name = '07_CDC',
@supports_net_changes = 1
Now we are all Set with DB set up and can move to SSIS
We actually need 2 packages to perform the CDC, First package mainly does the initial Load.
STEP 2: CDC_InitialLoad
We will start with CDC Control Task, drag it to control flow and edit it. Add a new ADO.NET connection manager for the Source database. Set CDC Control Operation to Mark initial load start. This is how CDC knows this is initial load.
Then Create a new package variable (CDC_State) to hold the CDC state information
All the information in the CDC_State is stored in the CDC_States table we created in the DB by mapping the variable to the table.
Then The next task be a simple DFT task to load data from source to destination.
Once we are done with DFT task, we have to mark CDC end.
Finally, our Initial Load package would look like below
Once we run this package we would have the destination Table loaded with the data.
STEP 3: Update the Source Table
Now, we update/insert/delete records from the source table.
STEP 4: Incremental Load
As we already have initial data in the target table our goal is to get only changed records.
Incremental Load package has
1.Exec SQL task: create 3 staging tables to capture the inserts updates and deletes. The staging tables should have the same structure as destination table.
2.CDC Control Task: This is very similar to the CDC task created in Initial load but we select Get Processing Range CDC Control Operation.
This is mainly selected to ensure SSIS to pick only changed records.
3.DFT Task to load all the inserts/updates/deletes:
3.1 CDC Source: Here we select the CDC Enabled Source table and specify CDC processing Mode.
CDC Processing Mode Net returns only the most recent change. CDC processing Mode All returns all the changes that have been performed on a record.
3.2 CDC Splitter:
CDC splitter just takes the input from the source and splits the inserts/updates and deletes. If you check the Input/Output parameter if has InsertOutput, DeleteOutput, UpdateOutput.
We drag 3 destinations for Inserts, Updates and deletes. Select the staging tables created in the exec sql task.
Going back to Control Flow.
4 Sequence Container:
Create a Sequence Container to have 3 Execute sql Tasks. Execute SQL tasks are mainly used to update the destination table by joining to the staging tables.
Staging tables contains records which are updated.
5 CDC Control Task: Drag another CDC Control Task to select the CDC Control Operation: Mark Range Processed. This indicates CDC that all these records are now processed and can be excluded from next run.
Once you execute the package, the data is updated on the destination table. The incremental Load package can be run every time we want to grab the latest changes from our Source table. It will store the CDC state every time it runs, ensuring that we only pick up new changes every time we run the package.