简体   繁体   中英

C# Distinct in LINQ query

I've got the problem after changing some code. My idea is like this: I am counting the number of words in document, but just 1 copy of a word for each document, for example:

Document 1 = Smith Smith Smith Smith => Smith x1

Document 2 = Smith Alan Alan => Smith x1, Alan x1

Document 3 = John John => John x1

but the total count of smiths should:

Smith x2 (in 2 documents out of 3), Alan x1 (1 out of 3 documents), John x1 (1 out of 3 documents)

I think it was working before when I had a separate method for distinct (counting also all the words if distinct = false ), now it produces just 1 .

The code before:

    private Dictionary<string, int> tempDict = new Dictionary<string, int>();
    private void Splitter(string[] file)
    {              
            tempDict = file
                .SelectMany(i => File.ReadAllLines(i)
                .SelectMany(line => line.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries))                    
                .AsParallel()
                .Select(word => word.ToLower()) 
                .Distinct())
                .GroupBy(word => word)                    
                .ToDictionary(g => g.Key, g => g.Count());
    }

It should be changed so it returns dictionary, but in the proces of making app changed this to this code:

private Dictionary<string, int> Splitter(string[] file, bool distinct, bool pairs)
{
    var query = file
        .SelectMany(i => File.ReadLines(i)
        .SelectMany(line => line.Split(new[] { ' '}, StringSplitOptions.RemoveEmptyEntries))
        .AsParallel()
        .Select(word => word.ToLower())
        .Where(word => !word.All(char.IsDigit)));
    if (distinct)
    {
        query = query.Distinct();
    }
    if (pairs)
    {
        var pairWise = query.Pairwise((first, second) => string.Format("{0} {1}", first, second));

        return query
                .Concat(pairWise)
                .GroupBy(word => word)
                .ToDictionary(g => g.Key, g => g.Count());
    }
    return query
        .GroupBy(word => word)
        .ToDictionary(g => g.Key, g => g.Count());           
}

Also note that query = file.Distinct(); returns just name of the document. SO it has to be something different.

@edit This is how I am calling this method:

  private void EnterDocument(object sender, RoutedEventArgs e)
    {
        List<string> myFile= new List<string>();
        OpenFileDialog openFileDialog = new OpenFileDialog();
        openFileDialog.Multiselect = true;
        openFileDialog.Filter = "All files (*.*)|*.*|Text files (*.txt)|*.txt";
        if (openFileDialog.ShowDialog() == true)
        {
            foreach (string filename in openFileDialog.FileNames)
            {
                myFile.Add(filename);

            }
        }
        string[] myFiles= myFile.ToArray();
        myDatabase = Splitter(myFiles, true, false);
    }

Distinct() will remove duplicates from your IEnumerable so calling it before the following...

return query
    .GroupBy(word => word)
    .ToDictionary(g => g.Key, g => g.Count());  

...will result in a list of all the unique words but with a count of 1.

Edit:

To solve the merging all lines issue you could do something like this:

List<string> allFilesWords = new List<string>();
foreach (var filename in file)
{
    var fileQuery = File.ReadLines(filename)
        .SelectMany(line => line.Split(new[] { ' '}, StringSplitOptions.RemoveEmptyEntries))
        .AsParallel()
        .Select(word => word.ToLower())
        .Where(word => !word.All(char.IsDigit)));
    allFilesWords.AddRange(fileQuery.Distinct());
}
return allFilesWords
        .GroupBy(word => word)
        .ToDictionary(g => g.Key, g => g.Count());       

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM