Using checksums for change data capture...

leadfoot Mon Jan 16, 2012 11:26 am

Hi,

It's been recommended in class and shown in some of the textbooks that you can use checksums or CRCs for change data capture when you don't have reliable update date columns in your source system. The problem is, (in SQL Server there is a checksum() function, and in oracle there is an ora_hash function), that checksums aren't unique enough. There are times, especially if you have large amounts of data, where checksums for two different rows will turn out to be the same. Here's a solution if you are staging your data in SQL Server, use the encryption functions HASHBYTES instead. Here's some code I thought I would share that will generate the hasbytes function you need for your staging table. Just tell it the name of your staging table, and the columns you don't want to include in your hashing (like the primary business key for example). I've tested the various encryption methods supported by the function and MD5 seems to be the fastest. This will spit out a hashbytes function call that you can include as a column in your select list:

Code:: set QUOTED_IDENTIFIER on declare @TableName nvarchar(1000) = 'okc_k_lines_b' declare @SchemaName nvarchar(max) = 'oracle' declare @Statement varchar(max) = 'HASHBYTES(''MD5'', ' -- @ColumnsToExclude should have spaces on either side of the list of columns you are excluding declare @ColumnsToExclude nvarchar(max) = ' id creation_date last_update_date ' declare @ColumnName nvarchar(1000) = '' declare FieldList cursor local fast_forward for select column_name from INFORMATION_SCHEMA.COLUMNS where TABLE_NAME = @TableName and TABLE_SCHEMA = @SchemaName and DATA_TYPE not in ('image', 'text', 'ntext') begin try open FieldList; fetch next from FieldList into @ColumnName; while @@FETCH_STATUS = 0 begin if CHARINDEX(@ColumnName, @ColumnsToExclude) = 0 set @Statement = @Statement + 'isnull(convert(nvarchar(max),' + @ColumnName + '),'''') + ' fetch next from FieldList into @ColumnName; end close FieldList; deallocate FieldList; set @Statement = SUBSTRING(@Statement, 0, len(@Statement) - 1) + ')' print @Statement end try begin catch declare @ErrorMessage nvarchar(4000), @ErrorSeverity int, @ErrorState int, @CursorState int select @ErrorMessage = ERROR_MESSAGE(), @ErrorSeverity = ERROR_SEVERITY(), @ErrorState = ERROR_STATE(); set @CursorState = CURSOR_STATUS('local','FieldList'); if @CursorState > -1 begin close FieldList; end; if @CursorState > -3 begin deallocate FieldList; end; end catch;

Here's the output:

Code:: HASHBYTES('MD5', isnull(convert(nvarchar(max),LINE_NUMBER),'') + isnull(convert(nvarchar(max),CHR_ID),'') + isnull(convert(nvarchar(max),CLE_ID),'') + isnull(convert(nvarchar(max),DNZ_CHR_ID),'') + isnull(convert(nvarchar(max),STS_CODE),'') + isnull(convert(nvarchar(max),PRICE_UNIT),'') + isnull(convert(nvarchar(max),CURRENCY_CODE),'') + isnull(convert(nvarchar(max),DATE_TERMINATED),'') + isnull(convert(nvarchar(max),START_DATE),'') + isnull(convert(nvarchar(max),END_DATE),'') + isnull(convert(nvarchar(max),ATTRIBUTE1),'') + isnull(convert(nvarchar(max),ATTRIBUTE8),'') + isnull(convert(nvarchar(max),ATTRIBUTE9),'') + isnull(convert(nvarchar(max),ATTRIBUTE10),'') + isnull(convert(nvarchar(max),CUST_ACCT_ID),'') + isnull(convert(nvarchar(max),BILL_TO_SITE_USE_ID),''))

The resulting hash keys you get from this function look something like this:

0x097408728CCFE6F70F2351554901AC02
0x8F985653E4F4F8393DDD9097AF46B64D
0x6AC705D1E493CE5D7ECC2E64CBEE9E61

The data type returned is varbinary. The biggest limitation to this approach is hashbytes can take a maximum input size of 8000 characters, so you might have to make two for a single record (I haven't had to yet). And it cannot take ntext, text, or image as input.

ngalemmo Mon Jan 16, 2012 1:11 pm

There are times, especially if you have large amounts of data, where checksums for two different rows will turn out to be the same.

True, but misleading. When using hashes for change detection it doesn't matter all that much. In a change detection application you are only comparing two rows, the old row and the new row based on a match on the natural key. You are not comparing a new row against all rows in the table. It is well understood that it is not advisable to use a 32 bit hash value as a unique key to a table (essentially what the latter scenario describes) because you will get collisions as the population grows (the probability of collision gets very high as the population grows beyond a few hundred thousand).

However, in a change detection application, a 32 bit hash provides a 1 in 4 billion chance of a false positive (i.e. the same hash value for different content). With corporations struggling to obtain a six-sigma quality level (1 error in 1 million), a method that has an error rate of 1 in 4 billion is well within accepted tolerances of error. If that is not good enough, the only method that truly guarantees absolutely perfect results is one that compares column for column. Note that a 16 bit hash function (CRC comes in 16 and 32 bit versions) is not acceptable for change detection as it only provides a 1 in 64K chance of a false positive, too low for most applications.

To put 1 in 4 billion in perspective, let's assume every day 1,000,000 rows change. It would be reasonable to expect that one or two errors will occur over a period of 10,000 years. It is not something I would lose sleep over.

Going with a larger hash (such as MD5) is overkill for a change detection application and can severely impact the performance of ETL processes. Users of typical ETL tools will define cached lookup structures to, hopefully, retain dimension data (PK, natural key, and hash code) in memory to speed processing. Increasing the size of the hash code from 4 to 16 or 20 bytes may cause tools to place such structures on disk, particularly for very large dimensions (which matter most), slowing the lookup process for no real improvement in process reliability. Besides, there have been cases of MD5 hash collisions in large populations, so such errors are not prevented, it only decreases the likelihood by a few more orders of magnitude.

hortoristic Tue May 14, 2013 1:55 pm

Getting a bit lost on "when" its appropriate to use MD5, or when it's overkill.

We have some 42 sources of varying quality from many different colleges, universities and state agencies where they send complete sets, not changed sets.

We had envisioned calculating the MD5 on the incoming stream, against an MD5 column in the target table(s) - (and even worse yet, one of the architects wants to instead calculate the MD5 on both incoming and target and not store the MD5).

We are only talking 3 to 10 million rows a year. Are you suggesting that we avoid MD5 and maybe use CRC - I don't think just natural keys will get us there.

ngalemmo Tue May 14, 2013 3:27 pm

If what you are trying to do is compare the content of two rows and those rows have been associated though some other means, such as a natural key, a 32 bit hash is more than sufficient. If it is not, in my opinion, they only other option is to compare column by column. The reason I say that is no matter what hash method you choose, it is a compromise between code simplicity and risk of error. If a site cannot tolerate a 1:4billion risk of error, they cannot tolerate any error. The only way to absolutely guarantee no error in comparison is to compare each column individually.

Sponsored content

Using checksums for change data capture...

Using checksums for change data capture...

Re: Using checksums for change data capture...

Using MD5

Re: Using checksums for change data capture...

Re: Using checksums for change data capture...