Delta CDC: How it Compares to SQL CDC

What is Meant by CDC?

Change Data Capture (CDC) is a mechanism to record the effect of a transaction on a table within a database or lake house. It records the details of insertions, updates, and deletions to data. It is separate from the table data and has its own lifecycle. It is expected to contain data and metadata for these actions.

CDC is available for SQL Server and Azure SQL. It is also available for Delta Tables within Lake Houses.

There are differences between these two versions of CDC. We will look at the SQL CDC and what, if any comparison feature exists for Delta Tables.

Compare CDC

Feature

SQL

Delta

Enable for the database

This permits the CDC to be enabled on a database table. It does not enable this for any table.

It can be set to default for all tables in a database as they are created either during a session or in a Spark Pool’s Configuration.

Enable for the table

Requires that the schema and table are set. Optionally the security role required to access the Change Feed can be set.

Schemas are not present in Lake houses. The table is set against the database. No gating is available for the Change Feed.

Limit Columns captured

Optionally you can specify the list of columns you wish to capture. Other columns will be excluded from the Change Feed

The option is on or off. Its not possible to filter the list of columns in the change feed.

Retention period.

The retention period is specified by the in-built stored procedure [sp_cdc_change_job].

It can be disabled entirely.

It can be given a time frame in minutes.

The Change feed for data that is subject to vacuum becomes unreadable.

It is not possible to independently clean up CDC from data transactions.

Transaction Management

Has two values for reviewing the change record order.
__$start_lsn for the transaction order and

__$seqval for record order in a transaction.

 

Delta has two values _commit_version and _commit_timestamp that detail the log version and time it was written. These are essentially interchangeable and represent a single transaction. It is not possible to return changes in a deterministic order inside a transaction.

Change Type

A numeric value identifies the nature of the change. Delete, Insert Update post or pre-image.

A text value identifies the nature of the change. Delete, Insert Update post or pre-image.

Identify the fields that have been changed.

__$update_mask is a binary that contains a detail of each column in order and whether it has been changed

There is no analog for update mask for Delta Table.

Elastic Schema

New columns will not be added to CDC records.

New Columns added will be added to new CDC records. NULL values will be returned for any record before the change.

Dynamic Masking

Masked values are replicated in the CDC records

Dynamic Masking is not available.



Examples

SQL

Create a Table and Database as required

USE master

GO

 

-- Drop the database if it already exists

IF  EXISTS (

    SELECT name

        FROM sys.databases

        WHERE name = N'Sandbox'

)

DROP DATABASE Sandbox

GO

 

CREATE DATABASE Sandbox

GO

 

USE [Sandbox]

GO

 

IF OBJECT_ID('dbo.Table_1', 'U') IS NOT NULL

  DROP TABLE dbo.Table_1

GO

CREATE TABLE [dbo].[Table_1](

    [Id] [int] IDENTITY(1,1) NOT NULL,

    [Text1] [nchar](10) NULL,

    [number1] Int NULL,

    [datetime1] DateTime NULL

) ON [PRIMARY]

GO

Set up the features required for CDC.

  • Enable CDC on the database
  • Enable CDC on the table

USE Sandbox

 

/*Gotcha

Could not update the metadata that indicates database Sandbox is enabled for Change Data Capture. The failure occurred when executing the command 'SetCDCTracked(Value = 1)'. The error returned was 15404: 'Could not obtain information about Windows NT group/user 'XXXXXXX\xxxxx', error code 0x54b.'. Use the action and error to determine the cause of the failure and resubmit the request.

 

In order to fix this the owner of the database needs to be changed

*/

EXEC sp_changedbowner 'sa'

GO

 

EXECUTE sys.sp_cdc_enable_db;

GO

 

USE Sandbox

GO

    EXEC sys.sp_cdc_enable_table

    @source_schema = N'dbo',

    @source_name   = N'Table_1',  

    @role_name     = N'MyRole'

   

GO

 

-- Set the retention time. This is in minutes

sp_cdc_change_job @job_type='cleanup', @retention=5

Review the CDC features

  • Databases enabled for CDC
  • Tables Enabled for CDC
  • Columns Captured

SELECT name,is_cdc_enabled

from sys.databases

 

select

    SCHEMA_NAME([schema_id]),

    name,

    is_tracked_by_cdc

from sys.tables;

 

EXECUTE sys.sp_cdc_get_captured_columns @capture_instance ='dbo_Table_1';

 

  • Change Data Caputure stats

NB this must be run in a query by itself

sys.sp_cdc_help_change_data_capture

Disable CDC on a table

    EXEC sys.sp_cdc_disable_table

    @source_schema = N'dbo',

    @source_name   = N'Table_1',

    @capture_instance = 'dbo_Table_1'

Enable CDC with limited Columns

EXEC sys.sp_cdc_enable_table

    @source_schema = N'dbo',

    @source_name   = N'Table_1',  

    @role_name     = N'MyRole',

    @captured_column_list = N'Text1, number1, datetime1'

Check Columns again

EXECUTE sys.sp_cdc_get_captured_columns @capture_instance ='dbo_Table_1';

 

Inserts

INSERT INTO [dbo].[Table_1]

           ([Text1]

           ,[number1]

           ,[datetime1])

     VALUES

      ('value1',1,'2023-11-01'),

('value2',2,'2023-11-02'),

('value3',3,'2023-11-03'),

('value4',4,'2023-11-04'),

('value5',5,'2023-11-05'),

('value6',6,'2023-11-06'),

('value7',7,'2023-11-07'),

('value8',8,'2023-11-08'),

('value9',9,'2023-11-09'),

('value10',10,'2023-11-10')

 

GO

Update

UPDATE [Table_1]

SET [Text1] = 'Change'

WHERE [datetime1]= '2023-11-01'

SELECT * FROM [dbo].[Table_1]

Delete

DELETE FROM [Table_1]

WHERE Id = 10

SELECT * FROM [dbo].[Table_1]

Review CDC records

Two methods

  1. Using cdc.fn_cdc_get_all_changes_dbo_Table_1
  2. Using cdc.dbo_Table_1_CT

These CDC functions are created automatically.

Columns in results

Field

Purpose and Notes

__$start_lsn

commit sequence all rows in a batch will share this value

__$seqval

Can be used to order the rows which occurred in the same transaction

__$operation

1 = delete, 2 = insert, 3 = update (before image), and 4 = update (after image)

__$update_mask

Is a variable bit mask with one defined bit for each captured column. For insert and, delete entries, the update mask has all bits set. Update rows, however, will have those bits set that corresponds to changed columns. (https://dba.stackexchange.com/questions/35936/change-data-capture-and-the-update-mask-binary)

 

-- method 1

DECLARE @from_lsn binary (10), @to_lsn binary (10)

 

SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_Table_1')

SET @to_lsn = sys.fn_cdc_get_max_lsn()

 

SELECT *

FROM cdc.fn_cdc_get_all_changes_dbo_Table_1(@from_lsn, @to_lsn, 'all')

ORDER BY __$seqval

 

--method 2

SELECT * from cdc.dbo_Table_1_CT

Turn Off CDC

EXECUTE sys.sp_cdc_disable_db;

GO


Delta Table 

Create a Table and Database as required

%%sql

CREATE TABLE test1.Table_1 (

    Id INT,

    Text1 STRING,

    number1 INT,

    datetime1 TIMESTAMP

  ) USING DELTA;

Import required Libraries

import delta

from pyspark.sql.functions import col

Define a function to change table CDC properties

def def setCDC(switch: bool, database: str):

    spark.sql(f"ALTER TABLE {database}.{table.name} SET TBLPROPERTIES (delta.enableChangeDataFeed = {switch})")  

Enable CDC

setCDC( True,"test1.Table_1")

Insert

%%sql

INSERT INTO test1.Table_1

VALUES

      (1,'value1',1,'2023-11-01'),

(2,'value2',2,'2023-11-02'),

(3,'value3',3,'2023-11-03'),

(4,'value4',4,'2023-11-04'),

(5,'value5',5,'2023-11-05'),

(6,'value6',6,'2023-11-06'),

(7,'value7',7,'2023-11-07'),

(8,'value8',8,'2023-11-08'),

(9,'value9',9,'2023-11-09'),

(10,'value10',10,'2023-11-10')

 

GO

Update

%%sql

UPDATE test1.Table_1

SET Text1 = 'Change'

WHERE ID = 1;

SELECT * From test1.Table_1

Delete

%%sql  

DELETE FROM test1.table_1

WHERE id =10;

SELECT * From test1.Table_1

Review CDC

changes_df = spark.read.format("delta").option("readChangeData", True).option("startingVersion", 1).table('test1.Table_1')

display(changes_df)

 

Conclusion

Both Change Data Capture methods offer the ability to manage the change in a transactional table. There are many patterns which flows from being able to capture this change e.g., loading data into a data warehouse where only new or changed data since the last load is required.

Considerations

SQL Transactions can contain an INSERT and a subsequent UPDATE in the same transaction. It is also possible to have a Foreign Key relationship to itself. So it could be important to process records within the same transaction sequentially. As such __$start_lsn for the transaction order and __$seqval for record order in a transaction are essential to allowing this.

Delta Table does not allow transactions in the same fashion as SQL therefore an inserted record would not be able to be updated until the next transaction. As such _commit_version is sufficient.

Security

While the gatekeeping role is intended to exclude users from accessing the change capture data it does not appear work very well and I’ve struggled to get it working in Azure SQL and on Premises SQL. I have also failed to find a working example in searches and Microsoft’s documentation. That said Deny on the CDC schema does prevent access to the table containing the CDC data.

The Dynamic masking available for SQL columns is present in the CDC data making securing PII possible.

Delta’s security model is based on lake access. It does not appear particularly easy to separate access to the change feed data. Presumably using ACL to limit access to the _change_data directory would prevent reading this data. I have however not tested this.

File Explorer list of folders and files representing a Delta Table


Utility

Delta Lake would not represent the source for applications but fits more neatly in a data engineering setting. SQL is often used as the source for OLTP systems. Where CDC is available on a SQL source it is beneficial to use this rather than the raw data as the source not least the separation of Change Data Capture records from the table data prevents issues of reading from the source.

Delta Change Data Capture can be utilized where either CDC data is not present in source data or for sources where this may not be present.

Flow Chart load data -> merge -> extract change log-> filter changes -. Transform

A pattern of using merge and change data capture would streamline on processing of data. In the above pattern select those records that are inserts or where the pre-image varies from the post-image to allow a proper change feed.

Links


Comments

Popular posts from this blog

Semantic Model Building Using Direct Lake

So, You Want to Pass the DP-600 Exam.