Hi,
I recently researched on the CHECKSUM & CHECKSUM_AGG functions in T-Sql and found them really useful. However, I was skeptical that there are chances of these functions returning the same values for non-identical inputs. I just got on to the forums and found more than one unhappy folks writing about their experience with these functions.
I am designing a large database (warehouse) and found these functions tempting to implement for the sake of
using CHECKSUM for
- indexing long character fields
- multiple colums of the same table that would involve in a join and use the new checksum field instead
using CHECKSUM_AGG for
- I bulkcopy flat file soruce data into a character field of a table and to ensure that I am not loading the same file multiple times, I plan to use CHECKSUM_AGG( CHECKSUM( [FlatFileRecord] ) ) and verify that no two loads have the same output.
Can some body suggest if I can trust these methods for my purpose?
Many thanks in advance!!
Thanks,
Harish
You can trust CHECKSUM to be very selective, but most likely you will get collisions from time to time. I would not use only CHECKSUM to "verify that no two loads have the same output", I would add a comprehensive check for rows with the same CHECKSUM.|||Thanks Kuz for your response.. I found CHECKSUM_AGG very efficient in terms of performance. For a 6 million row table, the result of my expression CHECKSUM_AGG( CHECKSUM( [FlatFileRecord] ) ) executed in just 40 seconds. I intend to store the result in a log and then when I load the next file, the same expression will be evaluated on the new data and compared with the previous values in the log. However, its not yet clear to me if I can use the CHECKSUM for my purposes I listed earlier.
Thanks
|||Well.. I found one of the comment few months back from Microsoft SQL Team.. here i gave as it is..
"Please don't use CHECKSUM or BINARY_CHECKSUM functions. They are not guaranteed to produce unique values for input. They are simple hash functions used to divide set of values into different ranges (for example to create compact indexes or partition the data). In fact, with the current implementation you can get duplicate checksum values quite easily and there are certain types of input values that will simply produce unexpected results (repeated values, NULLs etc). You could use hashbytes in SQL Server 2005 which can generate MD5 or MD4 hash for example which can avoid collisions but still no guarantee to produce unique value for each input."
Now you have decide which one you have to use...|||
Thanks Sekar... Here is my summary. The CHECKSUM & CHECKSUM_AGG are deterministic(same output always for the same input) but cannot guarantee unique output for each input. Please correct if I am wrong.
|||Yes.. You got it perfectly..
No comments:
Post a Comment