Data lookup
2 posters
Page 1 of 1
Data lookup
In my raw data I have 380+ million records and the primary key consists of 4 columns each 32 bit wide alphanumeric, and I get about 2 million new records every day. I need to be able to look up if the any of the new records currently exists in the 380 million and the processing time frame is very short. One solution I thought is to have a CRC generated on the 4 primary key columns and use the new CRC column as the lookup but I understand that Informatica CRC32 does generate the same CRC for different combinations. Any thoughts or suggestions would be greatly appreciated. Thank you in advance.
Upen
Upen
Upen- Posts : 1
Join date : 2011-10-24
Re: Data lookup
CRC32 is a industry standard algoritm to create a 32 bit hash value of a string. Given the fact it is 32 bit, there are only a little over 4 billion possible values. Given the intended application, there will be collisions (duplicate values for different strings), and far more frequently than the numbers seem to indicate.
However, in your situation, using a 32 bit CRC is a good way to cut down on the size of your index and possibly speed up lookups. Define an index on the CRC and allow duplicates. Queries should select based on the CRC and the columns that make up the natural key. Do not include the natural key columns in the index... it just bloats the index and slows things down. The CRC will get you down to a handful of rows, probably 4 at the most, that will be searched sequentially.
But, if your intent was to use a CRC as a replacement of the natural key columns and use it in a lookup cache within INFA, it won't work. It is a statistical certainty that you will have duplicate CRC values (in fact, it is very likely in much smaller populations). The only hash option that makes the chance of a duplicate highly unlikely is to use a large hash. There are algorithms that produce 160 bit hash values, which will do the trick, but at 20 bytes, is probably too large to make caching 380M+ values practical in INFA.
However, in your situation, using a 32 bit CRC is a good way to cut down on the size of your index and possibly speed up lookups. Define an index on the CRC and allow duplicates. Queries should select based on the CRC and the columns that make up the natural key. Do not include the natural key columns in the index... it just bloats the index and slows things down. The CRC will get you down to a handful of rows, probably 4 at the most, that will be searched sequentially.
But, if your intent was to use a CRC as a replacement of the natural key columns and use it in a lookup cache within INFA, it won't work. It is a statistical certainty that you will have duplicate CRC values (in fact, it is very likely in much smaller populations). The only hash option that makes the chance of a duplicate highly unlikely is to use a large hash. There are algorithms that produce 160 bit hash values, which will do the trick, but at 20 bytes, is probably too large to make caching 380M+ values practical in INFA.
Similar topics
» ETL for reference/lookup data
» Lookup Dimension
» Lookup tables to Dimension
» SSIS and the SQL 2008 MERGE Statement for Insert, Update and Delete
» Lookup Error (Spelling mistake causing 90 record losses)
» Lookup Dimension
» Lookup tables to Dimension
» SSIS and the SQL 2008 MERGE Statement for Insert, Update and Delete
» Lookup Error (Spelling mistake causing 90 record losses)
Page 1 of 1
Permissions in this forum:
You cannot reply to topics in this forum