Saturday, November 19, 2011

Field Level Auditing using Change Data Capture - Part 1

Ever have someone ask “who changed that data” or “when did that data change” or maybe “I need to see all changes to that data from last week”? There are many cases, from internal monitoring and quality control to external regulations that require basic auditing of data: what changed, who changed it, and when did it change.

Usually triggers or expensive data access comparisons were used to track what was being changed. But now there are some built in database technologies to track these changes for us. SQL Server 2008 introduced two new auditing features called Change Tracking and Change Data Capture. In the simplest terms, Change Tracking tracks changes at the record level, while Change Data Capture tracks changes at the field level. Each has strengths depending on your auditing needs. This article will concentrate on Change Data Capture (CDC) and will give an example of how to enable CDC on a database and table, and how to use some of the CDC stored procedures and functions to review the change data. In the next article, we will mart the data from the change table into a data warehouse using SSIS and then present the data in a simple SSRS report.

I currently work as a contractor for a government agency that deals with conservation contracts for farmers and ranchers. The main web application is used by over 10K users nationwide. Since the application deals with money, we are audited yearly by both internal and external auditors. One question they always ask is who is changing the data. We know who the user is making the changes so we stamp the records with the user id of the user who made the last change. Using triggers, we write a record history of the record level changes, but decided against trying to track field level changes due to performance and resource concerns.  Now with CDC, capturing those field level changes is easy and can be implemented in a way as to not hurt performance on production database systems. Another benefit of capturing the field level changes is with ETL processing of data warehouse and mart data, nightly delta changes can be applied without having to do a full data refresh.

CDC uses the transaction log to capture record changes for the tracked source tables. As changes are written to the transaction log, the capture process reads those changes and populates associated change tables. The capture process uses the same process as replication to read the transaction log, in fact CDC was written by the SQL Server Replication team. CDC is a simple, clean way to capture field level data changes.

Create some test data

Before we start, let’s create some sample data. Create a test database:

USE [master]
GOCREATE DATABASE [CDCDemo] ON  PRIMARY
( NAME = N'CDCDemo', FILENAME = N'C:\CDCDemo.mdf' , SIZE = 2048KB , MAXSIZE = UNLIMITED, FILEGROWTH = 1024KB )
 LOG ON
( NAME = N'CDCDemo_log', FILENAME = N'C:\CDCDemo_log.ldf' , SIZE = 1024KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)
GO
 
Next create a simple table that can be used to track changes. The table simulates a contract and includes the user id field of the person who last updated the record. The user id can be mapped to a master Users table to get user name and other information as needed:

CREATE TABLE [dbo].[Contract](
      [id] [int] NOT NULL,
      [name] [varchar](50) NULL,   
      [year] [int] NULL,
      [length] [int] NULL,
      [type] [int] NULL,
      [amount] [decimal](18, 2) NULL,
      [last_change_id] [int] NULL
 CONSTRAINT [PK_customer_office] PRIMARY KEY NONCLUSTERED
(
      [id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
 

Enable CDC in the database

Turning on CDC is simple and straight forward.  The first thing to do is enable CDC at the database level. Use the system stored procedure sp_cdc_enable_db to do this:

use [CDCDemo]
go-- enable CDC on the database
EXEC sys.sp_cdc_enable_db
GO
 
This adds several system tables used by CDC to track the tables, columns, indexes, and LSN (Log Sequence Number) mapping. We’ll use these tables later when moving the change data to the data warehouse. Also note that all CDC tables and functions use the cdc schema:


Enable the change tables

Next, specify which tables you want to track using CDC. Use the stored procedure sp_cdc_enable_table to do this.
-- enable CDC on table, creates capture job and cleanup job and cdc functions
EXEC sys.sp_cdc_enable_table
  @source_schema = N'dbo',
  @source_name   = N'Contract',
  @role_name     = NULL, -- gating role if needed
  @filegroup_name = N'PRIMARY' -- should put in separate filegroup
GO
 
The @filegroup_name parameter is used to specify a different file group for the change tables. It is recommended that the changes tables be put in a separate file group to minimize the performance impact of writing data to the change tables on normal database processing.

You can also use the @role_name parameter to specify a database role that has access to the change table data. Otherwise normal SELECT permissions are used to determine if a user can read the data in these tables. By default, all columns in the source table are added to the change table, but you can use the @captured_column_list parameter to subset that list if needed.

Executing this command does a few things. It creates the change table used to store the data by appending “_CT” to the schema and table name. It also creates functions that can be used to view this change data, fn_cdc_get_all_changes_ and fn_cdc_get_net_changes_. Using these functions, the change data can be queried looking at all records or just net changes over a designated time span.




Enabling a table for CDC also creates two jobs, a capture job and a cleanup job:




The capture job monitors the transaction log for entries related to the change tables and moves those changes to the change table. Since CDC uses the same process as replication, if replication is already enabled on the database, a separate capture job will not be created. CDC will just use the existing replication log reader job. The cleanup job will delete data from the change table on a regular schedule to keep it from growing too large.  By default the cleanup job will delete data after three days, but this retention period can be adjusted using the sp_cdc_change_job stored procedure.

Review the changes

Next let’s put some data in the table and see what happens. Insert a couple of rows, delete one of the rows, and update the other:

insert into contract
  values (1, 'One', 2011, 5, 1, 1000.00, 1234)
insert into contract
  values (2, 'Two', 2010, 10, 2, 2000.00, 2345)
delete from contract
 where id = 2
update contract
  set amount = 10000
  last_change_id = 2345
 where id = 1
 
Use the CDC functions to look at the data. These functions use Log Sequence Numbers (LSN) to specify a starting and ending LSN range to select. The system function fn_cdc_get_min_lsn gets the minimum LSN for the change table, the function fn_cdc_get_max_lsn gets the maximum:

DECLARE @from_lsn binary(10), @to_lsn binary(10)
SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_contract')
SET @to_lsn   = sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_contract
  (@from_lsn, @to_lsn, N'all update old')
  order by __$seqval
GO



The first four columns are metadata about the change. The first column __$startLSN is the commit log sequence number for the change. This can be converted to the date the change happened using the function fn_cdc_map_lsn_to_time. The column __$seqval is the sequence number and can be used to order multiple changes made within a single transaction or changes across time for a single record. The __$operation field indicates what type of change was made based on these values:
  • 1 – Delete
  • 2 – Insert
  • 3 – Update Before
  • 4 – Update After
Update Before has the original row values before the change and Update After has the new values after the change. The __$update_mask column is a bit flag with one bit for each captured column. The bit will be set if the field changed during the operation. The remaining columns are the captured columns from the source table. You can see in the results that there are two Inserts, one Delete, and an update, before and after, mirroring the changes we made.

The update mask tells which fields were changed with each operation. For insert and delete operations, all bits are set since all fields were changed. For updates, only the fields that changed are set. This is where the true power of CDC comes from. The two functions fn_cdc_get_column_ordinal and fn_cdc_is_bit_set can be used to query this bit mask and determine which columns changed:

DECLARE @from_lsn binary(10), @to_lsn binary(10)
DECLARE @name_ordinal int
DECLARE @year_ordinal int
DECLARE @length_ordinal int
DECLARE @type_ordinal int
DECLARE @amount_ordinal int
DECLARE @last_change_id_ordinal int
SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_contract')
SET @to_lsn = sys.fn_cdc_get_max_lsn()
SET @name_ordinal = sys.fn_cdc_get_column_ordinal('dbo_contract','name')
SET @year_ordinal = sys.fn_cdc_get_column_ordinal('dbo_contract','year')
SET @length_ordinal = sys.fn_cdc_get_column_ordinal('dbo_contract','length')
SET @type_ordinal = sys.fn_cdc_get_column_ordinal('dbo_contract','type')
SET @amount_ordinal = sys.fn_cdc_get_column_ordinal('dbo_contract','amount')
SET @last_change_id_ordinal = sys.fn_cdc_get_column_ordinal('dbo_contract','last_change_id')
SELECT fn_cdc_get_all_changes_dbo_contract.__$operation,
  fn_cdc_get_all_changes_dbo_contract.__$update_mask,
  fn_cdc_get_all_changes_dbo_contract.id,
  sys.fn_cdc_is_bit_set(@name_ordinal,__$update_mask) as 'IsNameUpdated',
  sys.fn_cdc_is_bit_set(@year_ordinal,__$update_mask) as 'IsYearUpdated',
  sys.fn_cdc_is_bit_set(@length_ordinal,__$update_mask) as 'IsLengthUpdated',
  sys.fn_cdc_is_bit_set(@type_ordinal,__$update_mask) as 'IsTypeUpdated',
  sys.fn_cdc_is_bit_set(@amount_ordinal,__$update_mask) as 'IsAmountUpdated',
  sys.fn_cdc_is_bit_set(@last_change_id_ordinal,__$update_mask) as 'IsLastChangeIdUpdated'
 FROM cdc.fn_cdc_get_all_changes_dbo_contract( @from_lsn, @to_lsn, 'all')
 ORDER BY __$seqval


The results show the IsUpdated fields set if the operation changed the corresponding field. These flags can then be used to generate reports showing what changed at the field level, who made those changes, and when the change happened.

Conclusion

Change Data Capture is a cool new feature introduced in SQL Server 2008. Using this technique, the holy grail of tracking field level changes can be implemented in a clean and efficient manner. In this article we looked at how to enable CDC for a database and table, and reviewed some of the stored procedures and functions that can be used to view the data in a meaningful way. In the next article we will look at marting the change table data using SSIS and create a SSRS report to view those changes.
You can’t always keep people’s hands out of the cookie jar, but with CDC, you at least know whose hands those are.

References

Basics of Change Data Capture (http://msdn.microsoft.com/en-us/library/cc645937.aspx)
Change Data Capture (http://technet.microsoft.com/en-us/library/bb522489.aspx)
There are also templates in SSMS that show the basics for how to use CDC.

Author: David Balthrop