Images in this post missing? We recently lost them in a site migration. We're working to restore these as you read this. Should you need an image in an emergency, please contact us at imagehelp@codebetter.com
Finding the difference between two Arrays, or un-LINQ-ing your code

imageLINQ is great, up to the point when it's not.  Then it's really not great at all.  When we trade simplicity of syntax for performance, we have to keep in mind there may come a point when we have to go back the code and factor out the shiny new toys.

My system is complex, but this task is simple.  I have about 95,000 video files to keep track of in a database.  Try as we might, at some point someone is going to make changes, add or remove files, without going through our software, so a periodic file audit is required.

I have two lists of file names, one is the list as it exists in the database, the other as it exists on disk.  This is greatly simplified, but the initial idea was something like this:

List<String> toAdd = (from f in existingFiles
                      where !indexedFiles.Contains(f)
                      select f).ToList();

List<String> toRemove = (from f in indexedFiles
                         where !existingFiles.Contains(f)
                         select f).ToList();

The power of LINQ, short and sweet.  Sadly, it appears to act like this under the hood:

List<String> toAdd = new List<String>();
foreach (String f in existingFiles)
    if (!indexedFiles.Contains(f))
        toAdd.Add(f);

List<String> toRemove = new List<String>();
foreach (String f in indexedFiles)
    if (!existingFiles.Contains(f))
        toRemove.Add(f);

These statements pegged the CPU for minutes at a time - can you see why?  How many times am I looping through each collection of strings?  Twice? Four times?  No, I'm afraid this is classic "Oh 2 Da N" performance here.  The problems lies in using the Contains() method - this method must search through the collection each time it's called, which is often.  To solve this, we'll have to kick it old school C-style:

indexedFiles.Sort();
existingFiles.Sort();

for (int iE = 0, iI = 0; iE < existingFiles.Count || iI < indexedFiles.Count; ) {
    if (iE >= existingFiles.Count) {
        toRemove.Add(indexedFiles[iI]);
        iI++;
    }
    else if (iI >= indexedFiles.Count) {
        toAdd.Add(existingFiles[iE]);
        iE++;
    }
    else {
        Int32 diff = existingFiles[iE].CompareTo(indexedFiles[iI]);
        if (diff == 0) {
            iE++;
            iI++;
        }
        else if (diff > 0) {
            toRemove.Add(indexedFiles[iI]);
            iI++;
        }
        else if (diff < 0) {
            toAdd.Add(existingFiles[iE]);
            iE++;
        }
    }
}

A quick sort of the lists and we can make some time saving assumptions as well as walk through both lists at the same time (.Net had no trouble sorting lists of 95,000 strings in record breaking time).  We start off comparing the strings at index 0 of each list (the first if and else if will be false - I'll come back to those shortly).  If they match, then the file both exists and is indexed - go to the next file in both lists.  If the existing string is greater than the index string, then we know the file no longer exists (remember, lists are sorted).  In this case only advance the index list to see if it "catches up" with the existing list.  If the existing string is less than the index string, the reverse is done.  If we run out of existing list before we finish the index list, those files no longer exist - which is what the first if statement handles (the first else if handles the other case for running out of index list first).

If you have to work through this for loop on scratch paper, don't feel bad - I did.  It's been a very long time since I've had to get this "raw" with my code.  It was worth it as the performance went from several minutes at 100% CPU to so fast I couldn't believe it had run.

I googled a bit to see if there were other solutions out there, and more importantly some method in the .Net framework that would solve this issue and came up empty.  I'm not convinced there isn't something in the bazillion methods living in the framework, so if you know of something, please let me know!


Posted 10-03-2008 5:24 PM by Michael C. Neel
Filed under: , ,

[Advertisement]

Comments

LukeB wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-03-2008 6:42 PM

Iesi.Collections.ISet<> is your friend.

Carlos Beppler wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-03-2008 7:36 PM

Try to use System.Collections.Generic.HashSet<T> instead of lists.

To get an HashSet you can change the fist code to:

HashSet<String> toAdd = new HashSet<string>(

   from f in existingFiles

   where !indexedFiles.Contains(f)

   select f).AsEnumerable() );

Kevin H wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-03-2008 7:55 PM

There's always the .Except() method, which does what you want, and seems to be a little faster even than your mergesort (i'm guessing because it *is* a mergesort written unmanaged)

Adding/removing 1000 out of

Linq w/ contains: 17734.375ms

Except: 15.625ms

MergeSort: 46.875ms

source code follows:

   class Program

   {

       const int NumberOfSamples = 50000;

       static void Main(string[] args)

       {

           Random r = new Random();

           List<int> oldNumbers = new List<int>();

           List<int> newNumbers = new List<int>();

           for (int i = 0; i < NumberOfSamples; i++)

               oldNumbers.Add(r.Next());

           newNumbers.AddRange(oldNumbers);

           for (int i = 0; i < 1000; i++)

               newNumbers.RemoveAt(r.Next(newNumbers.Count));

           for (int i = 0; i < 1000; i++)

               newNumbers.Add(r.Next());

           DateTime start;

           GC.Collect();

           start = DateTime.Now;

           ContainsMethod(oldNumbers, newNumbers);            

           Console.WriteLine("{0}ms", (DateTime.Now - start).TotalMilliseconds);

           GC.Collect();

           start = DateTime.Now;

           ExceptMethod(oldNumbers, newNumbers);

           Console.WriteLine("{0}ms", (DateTime.Now - start).TotalMilliseconds);

           GC.Collect();

           start = DateTime.Now;

           MergeSortMethod(oldNumbers, newNumbers);

           Console.WriteLine("{0}ms", (DateTime.Now - start).TotalMilliseconds);

           Console.ReadKey();

       }

       static void ExceptMethod(List<int> oldNumbers, List<int> newNumbers)

       {

           List<int> removed = oldNumbers.Except(newNumbers).ToList();

           List<int> added = newNumbers.Except(oldNumbers).ToList();

           Console.WriteLine("Removed: {0}, Added: {1}", removed.Count, added.Count);

       }

       static void ContainsMethod(List<int> oldNumbers, List<int> newNumbers)

       {

           List<int> removed = (from f in oldNumbers

                                where !newNumbers.Contains(f)

                                select f).ToList();

           List<int> added = (from f in newNumbers

                              where !oldNumbers.Contains(f)

                              select f).ToList();

           Console.WriteLine("Removed: {0}, Added: {1}", removed.Count, added.Count);

       }

       static void MergeSortMethod(List<int> oldNumbers, List<int> newNumbers)

       {

           List<int> removed = new List<int>();

           List<int> added = new List<int>();

           newNumbers.Sort();

           oldNumbers.Sort();

           for (int iE = 0, iI = 0; iE < oldNumbers.Count || iI < newNumbers.Count; ) {

               if (iE >= oldNumbers.Count) {

                   removed.Add(newNumbers[iI]);

                   iI++;

               }

               else if (iI >= newNumbers.Count) {

                   added.Add(oldNumbers[iE]);

                   iE++;

               }

               else {

                   Int32 diff = oldNumbers[iE].CompareTo(newNumbers[iI]);

                   if (diff == 0) {

                       iE++;

                       iI++;

                   }

                   else if (diff > 0) {

                       removed.Add(newNumbers[iI]);

                       iI++;

                   }

                   else if (diff < 0) {

                       added.Add(oldNumbers[iE]);

                       iE++;

                   }

               }

           }

           Console.WriteLine("Removed: {0}, Added: {1}", removed.Count, added.Count);

       }

   }

smaclell wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-03-2008 7:59 PM

Since what you are doing appears to be filtering each list based on where they overlap you might want to check out the Intersects and Except extension methods.

rsanidad.wordpress.com/.../linq-except-and-intersect

Good luck.

charles wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-04-2008 5:54 AM

I have to try this one.

Dew Drop - October 4, 2008 | Alvin Ashcraft's Morning Dew wrote Dew Drop - October 4, 2008 | Alvin Ashcraft's Morning Dew
on 10-04-2008 10:40 AM

Pingback from  Dew Drop - October 4, 2008 | Alvin Ashcraft's Morning Dew

Bill wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-04-2008 5:20 PM

The Except extension method is what you are looking for (though your solution is almost twice as fast because using the extension method you would be sorting the lists twice, contrary to the incorrect results suggested by Keven). My results on the comparison methods above are:

ExceptMethod: 31.25ms

MergeSortMethod: 15.625ms

though I changed the instantiation for loops:

       static void Main(string[] args) {

           Random r = new Random(1);

           List<int> oldNumbers = Enumerable.Range(1, NumberOfSamples).Select(x => r.Next(NumberOfSamples*2)).Distinct().ToList();

           List<int> newNumbers = Enumerable.Range(1, NumberOfSamples).Select(x => r.Next(NumberOfSamples * 2)).Distinct().ToList();

A slight improvement on your code would be (a couple of saved instructions per loop, we are talking microseconds of improvement here):

var removed = new List<int>();

var added = new List<int>();

newNumbers.Sort();

oldNumbers.Sort();

int iE = 0, iI = 0, oldCount = oldNumbers.Count, newCount = newNumbers.Count;

while (iE < oldCount && iI < newCount) {

   var diff = oldNumbers[iE].CompareTo(newNumbers[iI]);

   if (diff == 0) {

       iE++;

       iI++;

   } else if (diff > 0) {

       removed.Add(newNumbers[iI]);

       iI++;

   } else {

       added.Add(oldNumbers[iE]);

       iE++;

   }

}

if (iE >= oldCount) {

   while (iI < newCount) {

       removed.Add(newNumbers[iI]);

       iI++;

   }

} else if (iI >= newCount) {

   while (iE < oldCount) {

       added.Add(oldNumbers[iE]);

       iE++;

   }

}

Kevin H wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 10-07-2008 8:51 PM

Well, the DateTime timing isn't very good (15ms resolution, plus the GC might run, etc). I've re-timed it a few times, and it seems that your method is slightly better than the method in the post, is slightly better than Except. However, they are all orders of magnitude faster than the LINQ .contains version, and 500k entries take less than a couple hundred ms, which is significantly less time than it would take to actually load that amount of data from the file system or db.

Fnord wrote re: Finding the difference between two Arrays, or un-LINQ-ing your code
on 06-05-2009 4:50 PM

I want to point out that comparing anything to the Contains() version is not very useful. That is a very bad way to do the comparison, and there are much better ways in LINQ, so the crap performance of the Contains() code does not indicate any problem with LINQ.

A similar tack would be comparing a brute-force bubble sort in C++ to a tuned quicksort in C#. The slowness of the bubble sort would not indicate that C# code runs faster than C++.

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)