LINQ is great, up to the point when it's not. Then it's really not great at all. When we trade simplicity of syntax for performance, we have to keep in mind there may come a point when we have to go back the code and factor out the shiny new toys.
My system is complex, but this task is simple. I have about 95,000 video files to keep track of in a database. Try as we might, at some point someone is going to make changes, add or remove files, without going through our software, so a periodic file audit is required.
I have two lists of file names, one is the list as it exists in the database, the other as it exists on disk. This is greatly simplified, but the initial idea was something like this:
List<String> toAdd = (from f in existingFiles
where !indexedFiles.Contains(f)
select f).ToList();
List<String> toRemove = (from f in indexedFiles
where !existingFiles.Contains(f)
select f).ToList();
The power of LINQ, short and sweet. Sadly, it appears to act like this under the hood:
List<String> toAdd = new List<String>();
foreach (String f in existingFiles)
if (!indexedFiles.Contains(f))
toAdd.Add(f);
List<String> toRemove = new List<String>();
foreach (String f in indexedFiles)
if (!existingFiles.Contains(f))
toRemove.Add(f);
These statements pegged the CPU for minutes at a time - can you see why? How many times am I looping through each collection of strings? Twice? Four times? No, I'm afraid this is classic "Oh 2 Da N" performance here. The problems lies in using the Contains() method - this method must search through the collection each time it's called, which is often. To solve this, we'll have to kick it old school C-style:
indexedFiles.Sort();
existingFiles.Sort();
for (int iE = 0, iI = 0; iE < existingFiles.Count || iI < indexedFiles.Count; ) {
if (iE >= existingFiles.Count) {
toRemove.Add(indexedFiles[iI]);
iI++;
}
else if (iI >= indexedFiles.Count) {
toAdd.Add(existingFiles[iE]);
iE++;
}
else {
Int32 diff = existingFiles[iE].CompareTo(indexedFiles[iI]);
if (diff == 0) {
iE++;
iI++;
}
else if (diff > 0) {
toRemove.Add(indexedFiles[iI]);
iI++;
}
else if (diff < 0) {
toAdd.Add(existingFiles[iE]);
iE++;
}
}
}
A quick sort of the lists and we can make some time saving assumptions as well as walk through both lists at the same time (.Net had no trouble sorting lists of 95,000 strings in record breaking time). We start off comparing the strings at index 0 of each list (the first if and else if will be false - I'll come back to those shortly). If they match, then the file both exists and is indexed - go to the next file in both lists. If the existing string is greater than the index string, then we know the file no longer exists (remember, lists are sorted). In this case only advance the index list to see if it "catches up" with the existing list. If the existing string is less than the index string, the reverse is done. If we run out of existing list before we finish the index list, those files no longer exist - which is what the first if statement handles (the first else if handles the other case for running out of index list first).
If you have to work through this for loop on scratch paper, don't feel bad - I did. It's been a very long time since I've had to get this "raw" with my code. It was worth it as the performance went from several minutes at 100% CPU to so fast I couldn't believe it had run.
I googled a bit to see if there were other solutions out there, and more importantly some method in the .Net framework that would solve this issue and came up empty. I'm not convinced there isn't something in the bazillion methods living in the framework, so if you know of something, please let me know!
Posted
10-03-2008 5:24 PM
by
Michael C. Neel