Delta CDC: How it Compares to SQL CDC
What is Meant by CDC?
Change Data Capture (CDC) is a mechanism to record the effect of
a transaction on a table within a database or lake house. It records the details
of insertions, updates, and deletions to data. It is separate from the table
data and has its own lifecycle. It is expected to contain data and metadata for
these actions.
CDC is available for SQL Server and Azure SQL. It is also available for
Delta Tables within Lake Houses.
There are differences between these two versions of CDC. We will look at the SQL CDC and what, if any comparison feature exists for Delta Tables.
Compare CDC
Feature |
SQL |
Delta |
Enable
for the database |
This
permits the CDC to be enabled on a database table. It does not enable this for
any table. |
It can be
set to default for all tables in a database as they are created either during
a session or in a Spark Pool’s Configuration. |
Enable
for the table |
Requires
that the schema and table are set. Optionally the security role required to
access the Change Feed can be set. |
Schemas
are not present in Lake houses. The table is set against the database. No gating
is available for the Change Feed. |
Limit
Columns captured |
Optionally
you can specify the list of columns you wish to capture. Other columns will
be excluded from the Change Feed |
The
option is on or off. Its not possible to filter the list of columns in the
change feed. |
Retention
period. |
The
retention period is specified by the in-built stored procedure [sp_cdc_change_job]. It can
be disabled entirely. It can
be given a time frame in minutes. |
The
Change feed for data that is subject to vacuum becomes unreadable. It is
not possible to independently clean up CDC from data transactions. |
Transaction
Management |
Has two
values for reviewing the change record order. __$seqval
for record order in a transaction.
|
Delta
has two values _commit_version and _commit_timestamp that detail the log
version and time it was written. These are essentially interchangeable and represent
a single transaction. It is not possible to return changes in a deterministic
order inside a transaction. |
Change
Type |
A
numeric value identifies the nature of the change. Delete, Insert Update post
or pre-image. |
A text
value identifies the nature of the change. Delete, Insert Update post or pre-image. |
Identify
the fields that have been changed. |
__$update_mask
is a binary that contains a detail of each column in order and whether it has
been changed |
There
is no analog for update mask for Delta Table. |
Elastic
Schema |
New
columns will not be added to CDC records. |
New
Columns added will be added to new CDC records. NULL values will be returned
for any record before the change. |
Dynamic
Masking |
Masked values
are replicated in the CDC records |
Dynamic
Masking is not available. |
Examples
SQL
Create a Table and Database as
required
USE master GO -- Drop the database if it already exists IF EXISTS
( SELECT
name FROM
sys.databases WHERE
name = N'Sandbox' ) DROP DATABASE
Sandbox GO CREATE DATABASE
Sandbox GO USE [Sandbox] GO IF OBJECT_ID('dbo.Table_1',
'U') IS
NOT NULL DROP
TABLE dbo.Table_1 GO CREATE TABLE
[dbo].[Table_1]( [Id] [int] IDENTITY(1,1)
NOT NULL, [Text1] [nchar](10) NULL, [number1] Int NULL, [datetime1] DateTime NULL ) ON
[PRIMARY] GO |
Set up the features
required for CDC.
- Enable CDC on the database
- Enable CDC on the table
USE Sandbox /*Gotcha Could not update the metadata that indicates
database Sandbox is enabled for Change Data Capture. The failure occurred
when executing the command 'SetCDCTracked(Value = 1)'. The error returned was
15404: 'Could not obtain information about Windows NT group/user 'XXXXXXX\xxxxx',
error code 0x54b.'. Use the action and error to determine the cause of the
failure and resubmit the request. In order to fix this the owner of the database
needs to be changed */ EXEC sp_changedbowner 'sa' GO EXECUTE sys.sp_cdc_enable_db; GO USE Sandbox GO EXEC sys.sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'Table_1',
@role_name = N'MyRole' GO -- Set the retention time. This is in minutes sp_cdc_change_job @job_type='cleanup', @retention=5 |
Review the CDC
features
- Databases enabled for CDC
- Tables Enabled for CDC
- Columns Captured
SELECT
name,is_cdc_enabled from
sys.databases select
SCHEMA_NAME([schema_id]),
name,
is_tracked_by_cdc from
sys.tables; EXECUTE
sys.sp_cdc_get_captured_columns @capture_instance ='dbo_Table_1'; |
- Change Data Caputure stats
NB this must be run in a query by
itself
sys.sp_cdc_help_change_data_capture |
Disable CDC on a table
EXEC sys.sp_cdc_disable_table @source_schema = N'dbo', @source_name = N'Table_1', @capture_instance = 'dbo_Table_1' |
Enable CDC with limited Columns
EXEC sys.sp_cdc_enable_table @source_schema =
N'dbo', @source_name
= N'Table_1', @role_name
=
N'MyRole', @captured_column_list = N'Text1, number1, datetime1' |
Check Columns again
EXECUTE sys.sp_cdc_get_captured_columns
@capture_instance ='dbo_Table_1'; |
Inserts
INSERT INTO
[dbo].[Table_1] ([Text1]
,[number1]
,[datetime1]) VALUES ('value1',1,'2023-11-01'), ('value2',2,'2023-11-02'), ('value3',3,'2023-11-03'), ('value4',4,'2023-11-04'), ('value5',5,'2023-11-05'), ('value6',6,'2023-11-06'), ('value7',7,'2023-11-07'), ('value8',8,'2023-11-08'), ('value9',9,'2023-11-09'), ('value10',10,'2023-11-10') GO |
Update
UPDATE [Table_1] SET [Text1] = 'Change' WHERE [datetime1]= '2023-11-01' SELECT * FROM [dbo].[Table_1] |
Delete
DELETE FROM
[Table_1] WHERE Id = 10 SELECT * FROM [dbo].[Table_1] |
Review CDC records
Two methods
- Using
cdc.fn_cdc_get_all_changes_dbo_Table_1
- Using
cdc.dbo_Table_1_CT
These CDC functions are
created automatically.
Columns in results
Field |
Purpose
and Notes |
__$start_lsn |
commit
sequence all rows in a batch will share this value |
__$seqval |
Can be used
to order the rows which occurred in the same transaction |
__$operation |
1 = delete,
2 = insert, 3 = update (before image), and 4 = update (after image) |
__$update_mask |
Is a
variable bit mask with one defined bit for each captured column. For insert
and, delete entries, the update mask has all bits set. Update rows, however,
will have those bits set that corresponds to changed columns. (https://dba.stackexchange.com/questions/35936/change-data-capture-and-the-update-mask-binary) |
-- method 1 DECLARE @from_lsn binary (10), @to_lsn binary (10) SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_Table_1') SET @to_lsn = sys.fn_cdc_get_max_lsn() SELECT * FROM
cdc.fn_cdc_get_all_changes_dbo_Table_1(@from_lsn, @to_lsn, 'all') ORDER BY
__$seqval --method 2 SELECT * from cdc.dbo_Table_1_CT |
Turn Off CDC
EXECUTE sys.sp_cdc_disable_db; GO |
Delta Table
Create a Table and Database as required
%%sql CREATE TABLE
test1.Table_1 ( Id INT, Text1 STRING, number1 INT, datetime1 TIMESTAMP ) USING
DELTA; |
Import required Libraries
import delta from
pyspark.sql.functions import
col |
Define a function to change table CDC properties
def
def setCDC(switch:
bool, database: str): spark.sql(f"ALTER TABLE {database}.{table.name} SET TBLPROPERTIES (delta.enableChangeDataFeed = {switch})") |
Enable CDC
setCDC( True,"test1.Table_1") |
Insert
%%sql INSERT INTO
test1.Table_1 VALUES (1,'value1',1,'2023-11-01'), (2,'value2',2,'2023-11-02'), (3,'value3',3,'2023-11-03'), (4,'value4',4,'2023-11-04'), (5,'value5',5,'2023-11-05'), (6,'value6',6,'2023-11-06'), (7,'value7',7,'2023-11-07'), (8,'value8',8,'2023-11-08'), (9,'value9',9,'2023-11-09'), (10,'value10',10,'2023-11-10') GO |
Update
%%sql UPDATE test1.Table_1 SET Text1 = 'Change' WHERE ID = 1; SELECT * From test1.Table_1 |
Delete
%%sql DELETE FROM
test1.table_1 WHERE id =10; SELECT * From test1.Table_1 |
Review CDC
changes_df = spark.read.format("delta").option("readChangeData", True).option("startingVersion", 1).table('test1.Table_1') display(changes_df) |
Conclusion
Both Change Data Capture methods offer the ability to manage
the change in a transactional table. There are many patterns which flows from
being able to capture this change e.g., loading data into a data warehouse where
only new or changed data since the last load is required.
Considerations
SQL Transactions can contain an INSERT and a subsequent UPDATE in the
same transaction. It is also possible to have a Foreign Key relationship to itself.
So it could be important to process records within the same transaction sequentially.
As such __$start_lsn for the transaction order and __$seqval for record order
in a transaction are essential to allowing this.
Delta Table does not allow transactions in the same fashion as SQL
therefore an inserted record would not be able to be updated until the next
transaction. As such _commit_version is sufficient.
Security
While the gatekeeping
role is intended to exclude users from accessing the change capture data it does
not appear work very well and I’ve struggled to get it working in Azure SQL and
on Premises SQL. I have also failed to find a working example in searches and
Microsoft’s documentation. That said Deny on the CDC schema does prevent access
to the table containing the CDC data.
The Dynamic
masking available for SQL columns is present in the CDC data making securing
PII possible.
Delta’s
security model is based on lake access. It does not appear particularly easy to
separate access to the change feed data. Presumably using ACL to limit access
to the _change_data directory would prevent reading this data. I have however
not tested this.
Utility
Delta Lake would
not represent the source for applications but fits more neatly in a data engineering
setting. SQL is often used as the source for OLTP systems. Where CDC is available
on a SQL source it is beneficial to use this rather than the raw data as the
source not least the separation of Change Data Capture records from the table
data prevents issues of reading from the source.
Delta Change
Data Capture can be utilized where either CDC data is not present in source
data or for sources where this may not be present.
A pattern
of using merge and change data capture would streamline on processing of data.
In the above pattern select those records that are inserts or where the
pre-image varies from the post-image to allow a proper change feed.
Comments
Post a Comment