Thursday, March 8, 2012

cheksum function bug

Hi
i came accross of the problem with checksum function which we use
intensively for large db searches.
According to manul it should always produce different value for different
string value and unique for the same strings.
Here is an example when different strings produce the same checksum value:
select checksum('AJUBELTKAJUBEL')
select checksum('AJUBSBTKAJUBSB')
select checksum('AJUCGBTKAJUCGB')
--
-2013265910Gene,
Although CHECKSUM should always produce the same value for the same string,
it absolutely does not always produce a different value for different
strings. The Books Online do not promise this, but say, "we do not
recommend using CHECKSUM to detect whether values have changed, unless your
application can tolerate occasionally missing a change."
When you think about it, CHECKSUM returns an integer value, with only
4,294,967,296 possible values. However, a 10-character string using the 10
digits and 26 letters of the alphabet has 3,656,158,440,062,976 possible
values. The likelihood of collision is increased, because the algorithm
uses exclusive-ors in computing the has value.
RLF
"Gene." <Gene@.discussions.microsoft.com> wrote in message
news:61D1F361-C95A-4F5C-B29A-249EFE0412D8@.microsoft.com...
> Hi
> i came accross of the problem with checksum function which we use
> intensively for large db searches.
> According to manul it should always produce different value for different
> string value and unique for the same strings.
> Here is an example when different strings produce the same checksum value:
> select checksum('AJUBELTKAJUBEL')
> select checksum('AJUBSBTKAJUBSB')
> select checksum('AJUCGBTKAJUCGB')
> --
> -2013265910
>|||Thank you Russell again.
I was under impression that different combination of letters should always
produce different integer.
Regards, Gene.
"Russell Fields" wrote:
> Gene,
> Although CHECKSUM should always produce the same value for the same string,
> it absolutely does not always produce a different value for different
> strings. The Books Online do not promise this, but say, "we do not
> recommend using CHECKSUM to detect whether values have changed, unless your
> application can tolerate occasionally missing a change."
> When you think about it, CHECKSUM returns an integer value, with only
> 4,294,967,296 possible values. However, a 10-character string using the 10
> digits and 26 letters of the alphabet has 3,656,158,440,062,976 possible
> values. The likelihood of collision is increased, because the algorithm
> uses exclusive-ors in computing the has value.
> RLF
> "Gene." <Gene@.discussions.microsoft.com> wrote in message
> news:61D1F361-C95A-4F5C-B29A-249EFE0412D8@.microsoft.com...
> > Hi
> >
> > i came accross of the problem with checksum function which we use
> > intensively for large db searches.
> > According to manul it should always produce different value for different
> > string value and unique for the same strings.
> >
> > Here is an example when different strings produce the same checksum value:
> >
> > select checksum('AJUBELTKAJUBEL')
> > select checksum('AJUBSBTKAJUBSB')
> > select checksum('AJUCGBTKAJUCGB')
> > --
> >
> > -2013265910
> >
>
>|||Gene,
Just getting back to you. If you are saving the checkdigit for a string in
a column that is a good way to speed up some searches. In fact, the
CHECKDIGIT was designed as a hash function for just that reason. However,
to make sure you have a hit, you also have to use the actual value. E.g.
DECLARE @.LongString NVARCHAR (255)
SET @.LongString = 'We do not recommend using CHECKSUM to detect whether
values have changed, unless your
application can tolerate occasionally missing a change.'
CREATE TABLE LongStrings
(StringKey INT,
SearchableText NVARCHAR(255),
TextHash INT)
CREATE INDEX TextHash ON LongStrings (TextHash)
INSERT INTO LongStrings VALUES (1, @.LongString, CHECKSUM (@.LongString))
-- And insert many many more rows in real life.
SELECT * FROM LongStrings
WHERE TextHash = CHECKSUM (@.LongString)
AND SearchableText = @.LongString
You will notice that only the hash has an index. This reduces storage for
the LongStrings table since it does not have the big index on
SearchableText, but it does require some extra I/O to compare the
SearchableText for rows where TextHash is equal to the queried value.
RLF
"Gene." <Gene@.discussions.microsoft.com> wrote in message
news:D1BFCA68-84CA-4DFB-8F20-DD70E3C2791D@.microsoft.com...
> Thank you Russell again.
> I was under impression that different combination of letters should always
> produce different integer.
> Regards, Gene.
> "Russell Fields" wrote:
>> Gene,
>> Although CHECKSUM should always produce the same value for the same
>> string,
>> it absolutely does not always produce a different value for different
>> strings. The Books Online do not promise this, but say, "we do not
>> recommend using CHECKSUM to detect whether values have changed, unless
>> your
>> application can tolerate occasionally missing a change."
>> When you think about it, CHECKSUM returns an integer value, with only
>> 4,294,967,296 possible values. However, a 10-character string using the
>> 10
>> digits and 26 letters of the alphabet has 3,656,158,440,062,976 possible
>> values. The likelihood of collision is increased, because the algorithm
>> uses exclusive-ors in computing the has value.
>> RLF
>> "Gene." <Gene@.discussions.microsoft.com> wrote in message
>> news:61D1F361-C95A-4F5C-B29A-249EFE0412D8@.microsoft.com...
>> > Hi
>> >
>> > i came accross of the problem with checksum function which we use
>> > intensively for large db searches.
>> > According to manul it should always produce different value for
>> > different
>> > string value and unique for the same strings.
>> >
>> > Here is an example when different strings produce the same checksum
>> > value:
>> >
>> > select checksum('AJUBELTKAJUBEL')
>> > select checksum('AJUBSBTKAJUBSB')
>> > select checksum('AJUCGBTKAJUCGB')
>> > --
>> >
>> > -2013265910
>> >
>>|||Grrr, I see that I used CHECKDIGIT in the first paragraph. That should be,
of course, CHECKSUM. - RLF
"Russell Fields" <russellfields@.nomail.com> wrote in message
news:%23J0Q6OYEIHA.3980@.TK2MSFTNGP03.phx.gbl...
> Gene,
> Just getting back to you. If you are saving the checkdigit for a string
> in a column that is a good way to speed up some searches. In fact, the
> CHECKDIGIT was designed as a hash function for just that reason. However,
> to make sure you have a hit, you also have to use the actual value. E.g.
> DECLARE @.LongString NVARCHAR (255)
> SET @.LongString = 'We do not recommend using CHECKSUM to detect whether
> values have changed, unless your
> application can tolerate occasionally missing a change.'
> CREATE TABLE LongStrings
> (StringKey INT,
> SearchableText NVARCHAR(255),
> TextHash INT)
> CREATE INDEX TextHash ON LongStrings (TextHash)
> INSERT INTO LongStrings VALUES (1, @.LongString, CHECKSUM (@.LongString))
> -- And insert many many more rows in real life.
> SELECT * FROM LongStrings
> WHERE TextHash = CHECKSUM (@.LongString)
> AND SearchableText = @.LongString
> You will notice that only the hash has an index. This reduces storage for
> the LongStrings table since it does not have the big index on
> SearchableText, but it does require some extra I/O to compare the
> SearchableText for rows where TextHash is equal to the queried value.
>
> RLF
> "Gene." <Gene@.discussions.microsoft.com> wrote in message
> news:D1BFCA68-84CA-4DFB-8F20-DD70E3C2791D@.microsoft.com...
>> Thank you Russell again.
>> I was under impression that different combination of letters should
>> always
>> produce different integer.
>> Regards, Gene.
>> "Russell Fields" wrote:
>> Gene,
>> Although CHECKSUM should always produce the same value for the same
>> string,
>> it absolutely does not always produce a different value for different
>> strings. The Books Online do not promise this, but say, "we do not
>> recommend using CHECKSUM to detect whether values have changed, unless
>> your
>> application can tolerate occasionally missing a change."
>> When you think about it, CHECKSUM returns an integer value, with only
>> 4,294,967,296 possible values. However, a 10-character string using the
>> 10
>> digits and 26 letters of the alphabet has 3,656,158,440,062,976 possible
>> values. The likelihood of collision is increased, because the algorithm
>> uses exclusive-ors in computing the has value.
>> RLF
>> "Gene." <Gene@.discussions.microsoft.com> wrote in message
>> news:61D1F361-C95A-4F5C-B29A-249EFE0412D8@.microsoft.com...
>> > Hi
>> >
>> > i came accross of the problem with checksum function which we use
>> > intensively for large db searches.
>> > According to manul it should always produce different value for
>> > different
>> > string value and unique for the same strings.
>> >
>> > Here is an example when different strings produce the same checksum
>> > value:
>> >
>> > select checksum('AJUBELTKAJUBEL')
>> > select checksum('AJUBSBTKAJUBSB')
>> > select checksum('AJUCGBTKAJUCGB')
>> > --
>> >
>> > -2013265910
>> >
>>
>

No comments:

Post a Comment