简体   繁体   中英

how to efficiently Comparing two lists with 500k objects and strings

So i have a main directory with sub folders and around 500k images. I know alot of theese images does not exist in my database and i want to know which ones so that i can delete them.

This is the code i have so far:

var listOfAdPictureNames = ImageDB.GetAllAdPictureNames();

var listWithFilesFromImageFolder = ImageDirSearch(adPicturesPath);

var result = listWithFilesFromImageFolder.Where(p => !listOfAdPictureNames.Any(q => p.FileName == q));

var differenceList = result.ToList();

listOfAdPictureNames is of type List<string>

here is my model that im returing from the ImageDirSearch:

public class CheckNotUsedAdImagesModel
{
    public List<ImageDirModel> ListWithUnusedAdImages { get; set; }
}

public class ImageDirModel
{
    public string FileName { get; set; }
    public string Path { get; set; }
}

and here is the recursive method to get all images from my folder.

private List<ImageDirModel> ImageDirSearch(string path)
        {
            string adPicturesPath = ConfigurationManager.AppSettings["AdPicturesPath"];
            List<ImageDirModel> files = new List<ImageDirModel>();

try
{
    foreach (string f in Directory.GetFiles(path))
    {
        var model = new ImageDirModel();
        model.Path = f.ToLower();
        model.FileName = Path.GetFileName(f.ToLower());
        files.Add(model);
    }
    foreach (string d in Directory.GetDirectories(path))
    {
        files.AddRange(ImageDirSearch(d));
    }
}
catch (System.Exception excpt)
{
    throw new Exception(excpt.Message);
}

return files;

}

The problem I have is that this row:

var result = listWithFilesFromImageFolder.Where(p => !listOfAdPictureNames.Any(q => p.FileName == q));

takes over an hour to complete. I want to know if there is a better way to check in my images folder if there are images there that doesn't exist in my database.

Here is the method that get all the image names from my database layer:

    public static List<string> GetAllAdPictureNames()
    {
        List<string> ListWithAllAdFileNames = new List<string>();

        using (var db = new DatabaseLayer.DBEntities())
        {
            ListWithAllAdFileNames = db.ad_pictures.Select(b => b.filename.ToLower()).ToList();
        }



        if (ListWithAllAdFileNames.Count < 1)
            return new List<string>();

        return ListWithAllAdFileNames;
    }

Perhaps Except is what you're looking for. Something like this:

var filesInFolderNotInDb = listWithFilesFromImageFolder.Select(p => p.FileName).Except(listOfAdPictureNames).ToList();

Should give you the files that exist in the folder but not in the database.

As I said in my comment, you seem to have recreated the FileInfo class, you don't need to do this, so your ImageDirSearch can become the following

private IEnumerable<string> ImageDirSearch(string path)
{
    return Directory.EnumerateFiles(path, "*.jpg", SearchOption.TopDirectoryOnly);
}

There doesn't seem to be much gained by returning the whole file info where you only need the file name, and also this only finds jpgs, but this can be changed..

The ToLower calls are quite expensive and a bit pointless, so is the to list when you are planning on querying again so you can get rid of that and return an IEnumerable again, (this is in the GetAllAdPictureNames method)

Then your comparison can use equals and ignore case.

!listOfAdPictureNames.Any(q => p.Equals(q, StringComparison.InvariantCultureIgnoreCase));

One more thing that will probably help is removing items from the list of file names as they are found, this should make the searching of the list quicker every time one is removed since there is less to iterate through.

Instead of the search being repeated on each of these lists its optimal to sort second list "listOfAdPictureNames" (Use any of n*log(n) sorts). Then checking for existence by binary search will be the most efficient all other techniques including the current one are exponential in order.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM