Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)

This was my frustrating moment...er...hour of the day:  my discovery that an IEqualityComparer's GetHashCode() method is what actually gets called when comparing two objects in LINQ set operators such as Intersect, Union, and Distinct.  Besides having me running in circles for a while, it makes absolutely no sense to me that these set operators would rely on comparing the hash code of two objects rather than the returned value of their Equals method.  Grrr.

Here are some related posts which describe the problem in more detail and echo my frustration...wish I would've found these before I found the problem!

The discussion surrounding Equals/GetHashCode is deep and long, and surprisingly difficult to implement correctly.  That's why it's equally important to be used correctly...including by the .NET framework itself.  (For an example of an almost right, but still incorrect implementation, please see http://devlicio.us/blogs/billy_mccafferty/archive/2007/04/25/using-equals-gethashcode-effectively.aspx.  A link to a much righter implementation is found at the top of the post.  ;)

The crux of the problem behind using GetHashCode for equality is that GetHashCode doesn't guarantee uniqueness.  So with large sets of data, operations such as Intersection, Union and Distinct may return incorrect results.  At first glance, and in line with my initial assumption, the odds of this happening seem astonishingly small because GetHashCode can return close to 4.3 billion unique values.  Incredibly interesting (to me anyway), as raised by Dror Speiser and Frank Laub on the S#arp Architecture forums a while back, due to the birthday paradox, you only actually need 77,163 objects before there's a 50% probability of a collision.  Want a 75% collision?  Try just 109,125 objects!  These kinds of numbers aren't too uncommon in many applications.

So if you're using one of the LINQ set operators and providing your own IEqualityComparer, take note that it's the objects' GetHashCode() method that's determining uniqueness, not Equals...and the implications this has on dealing with large sets of data.

Billy McCafferty
http://www.itsamuraischool.com


Posted 05-22-2009 4:27 PM by Billy McCafferty
Filed under:

[Advertisement]

Comments

Ayende Rahien wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 05-23-2009 5:51 AM

That should like it is a bug in how they are being put out

Ken Tong wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 05-24-2009 1:34 AM

I tested but found that Intersect() first test for hash equality before calling Equal().

This is a reasonable implementation. What's your opinion?

Peter Ritchie wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 05-25-2009 3:52 PM

The title of this post is very misleading.  It suggests that Equals never gets called when using Enumerable.Intersect(IEnumerable<T>, IEnumerable<T>, IEqualityComparer<T>).

My tests show that it does.  It calls IEqualityComparer.GetHashCode first, then T.GetHashCode() then eventually calls IEqualityComparer.Equals.

Joshua Morgan wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 05-28-2009 12:25 AM

As the post above pointed out equals does eventual get called assuming you have correctly implemented GetHashCode.  i.e. so that the has codes for two equal objects are the same.  So you shouldn't get any false positives even with very large sets.  Presumably GetHashCode gets called because the set extension methods use a dictionary.  If you are using a custom equals method you have to create a custom GetHashCode that 'matches' for dictionary type collections to work properly.

Tobias Manthey wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 07-31-2009 5:56 AM

Hmm "take note that it's the objects' GetHashCode() method that's determining uniqueness, not Equals" is not true in my point of view.

GetHashCode() is called first and only if 2 items (x,y)  return the same HashCode then Equals() is called. GetHashCode gives the comparer an indication whether its worth really comparing 2 objects by calling the Equals() method.

This makes a lot sense for large objects. Guess you want compare a hundred strings each 10 Megs.

Comparing each string by comparing their content (Equals) would be extremly slow.

Rick O'Shay wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 03-21-2010 6:37 PM

So all of that blah, blah, blah for a wrong assertion.  Equals must be consistent with GetHashCode so the optimal approach is to call GetHashCode first.  All the blather about the odds of a duplicate is irrelevant, too. Nobody assumes GetHashCode is unique. In fact, if you like you can return 666 on every object, every time. That would be very, very silly (you've lost the speed benefits) but it's fully functional.

devlicio.us wrote re: Why doesn't IEqualityComparer use Equals() in LINQ's Intersect? (with a stroll down the lane of GetHashCode and the Birthday Paradox)
on 04-29-2011 11:16 AM

Why doesn t iequalitycomparer use equals in linq s intersect including a stroll down the lane of gethashcode and the birthday paradox.. Corking :)

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)