This was my frustrating moment...er...hour of the day: my discovery that an IEqualityComparer's GetHashCode() method is what actually gets called when comparing two objects in LINQ set operators such as Intersect, Union, and Distinct. Besides having me running in circles for a while, it makes absolutely no sense to me that these set operators would rely on comparing the hash code of two objects rather than the returned value of their Equals method. Grrr.
Here are some related posts which describe the problem in more detail and echo my frustration...wish I would've found these before I found the problem!
The discussion surrounding Equals/GetHashCode is deep and long, and surprisingly difficult to implement correctly. That's why it's equally important to be used correctly...including by the .NET framework itself. (For an example of an almost right, but still incorrect implementation, please see http://devlicio.us/blogs/billy_mccafferty/archive/2007/04/25/using-equals-gethashcode-effectively.aspx. A link to a much righter implementation is found at the top of the post. ;)
The crux of the problem behind using GetHashCode for equality is that GetHashCode doesn't guarantee uniqueness. So with large sets of data, operations such as Intersection, Union and Distinct may return incorrect results. At first glance, and in line with my initial assumption, the odds of this happening seem astonishingly small because GetHashCode can return close to 4.3 billion unique values. Incredibly interesting (to me anyway), as raised by Dror Speiser and Frank Laub on the S#arp Architecture forums a while back, due to the birthday paradox, you only actually need 77,163 objects before there's a 50% probability of a collision. Want a 75% collision? Try just 109,125 objects! These kinds of numbers aren't too uncommon in many applications.
So if you're using one of the LINQ set operators and providing your own IEqualityComparer, take note that it's the objects' GetHashCode() method that's determining uniqueness, not Equals...and the implications this has on dealing with large sets of data.
05-22-2009 4:27 PM