简体   繁体   中英

Arraylist vs List vs Dictionary

Can anyone save me? I have the following code:

private List<string> GenerateTerms(string[] docs)
{
    List <string> uniques = new List<string>();

    for (int i = 0; i < docs.Length; i++)
    {
        string[] tokens = docs[i].Split(' ');

        List<string> toktolist = new List<string>(tokens.ToList());

        var query = toktolist.GroupBy(word => word)
             .OrderByDescending(g => g.Count())
             .Select(g => g.Key)
             .Take(20000);              

        foreach (string k in query)
        {
            if (!uniques.Contains(k)) 
                uniques.Add(k);
        }
    }            

    return uniques;            
}

It is about generate terms from number of documents based on the highest frequency. i did the same procedure using dictionary. in both cases spent 440 milliseconds. but the surprise was when i used the procedure with array list as in the following code

private ArrayList GenerateTerms(string[] docs)
{
    Dictionary<string, int> yy = new Dictionary<string, int>();
    ArrayList uniques = new ArrayList();

    for (int i = 0; i < docs.Length; i++)
    {
        string[] tokens = docs[i].Split(' ');
        yy.Clear();
        for (int j = 0; j < tokens.Length; j++)
            {
                if (!yy.ContainsKey(tokens[j].ToString()))
                    yy.Add(tokens[j].ToString(), 1);
                else
                    yy[tokens[j].ToString()]++;
            }

            var sortedDict = (from entry in yy
                              orderby entry.Value descending
                              select entry).Take(20000).ToDictionary
                          (pair => pair.Key, pair => pair.Value);               

            foreach (string k in sortedDict.Keys)
            {                    
                if (!uniques.Contains(k)) 
                uniques.Add(k);
            }
        }            

        return uniques;            
    }  

it spent 350 milliseconds. shouldn't Array list be slower than List and dictionary ?? please save me with this tense.

Your code does a lot of unnecessary work and uses inefficient data structures.

Try this instead:

private List<string> GenerateTerms(string[] docs)
{
     var result = docs
         .SelectMany(doc => doc.Split(' ')
                               .GroupBy(word => word)
                               .OrderByDescending(g => g.Count())
                               .Select(g => g.Key)
                               .Take(20000))
         .Distinct()
         .ToList();   
     return result;
}

Refactored version to make it easier to read:

private List<string> GenerateTerms(string[] docs)
{
    return docs.SelectMany(doc => ProcessDocument(doc)).Distinct().ToList();
}

private IEnumerable<string> ProcessDocument(string doc)
{
    return doc.Split(' ')
              .GroupBy(word => word)
              .OrderByDescending(g => g.Count())
              .Select(g => g.Key)
              .Take(10000);
}

I like Mark's solution. However, I thought that if you leverage a Dictionary properly you could squeeze out some more performance. Sure enough, this is a lot faster...

private static List<string> GenerateTerms(string[] docs)
{
    var termsDictionary = new Dictionary<string, int>();

    foreach (var doc in docs)
    {
        var terms = doc.Split(' ');
        int uniqueTermsCount = 0;

        foreach (string term in terms)
        {
            if (termsDictionary.ContainsKey(term))
                termsDictionary[term]++;
            else
            {
                uniqueTermsCount++;
                termsDictionary[term] = 1;
            }
        }

        if (uniqueTermsCount >= 20000)
            break;
    }

    return (from entry in termsDictionary
                    orderby entry.Value descending
                    select entry.Key).ToList();
}

To explain briefly, termsDictionary holds a Dictionary of terms and the number of times each term is repeated. Then the Linq query at the end returns the terms in descending order by the number of occurrences.

UPDATE

I added code to limit the unique number of terms to 20,000 per doc.

Here are the benchmarking results...

  • 322 ms (Original)
  • 284 ms (Mark Byers solution)
  • 113 ms (Leveraging the Dictionary as above)

Below is the code I used to generate the docs array and run the tests...

static void Main(string[] args)
{
    string[] docs = new string[50000];

    for (int i = 0; i < docs.Length; i++)
    {
        docs[i] = "a man a plan a canal panama";
    }

    // warm up (don't time this)
    GenerateTermsOriginal(docs);

    Stopwatch sw = new Stopwatch();
    sw.Restart();
    var t1 = GenerateTermsOriginal(docs);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds + " ms");

    sw.Restart();
    var t2 = GenerateTermsLinq(docs);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds + " ms");

    sw.Restart();
    var t3 = GenerateTermsDictionary(docs);
    sw.Stop();
    Console.WriteLine(sw.ElapsedMilliseconds + " ms");
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM