How would you develop a frequency-sorted list of the ten thousand most-used words in the English language?

Question

I was once asked by a current employee how I would develop a frequency-sorted list of the ten thousand most-used words in the English language. Suggest a solution in the language of your choosing, though I prefer C#.

Please provide not only an implementation, but also an explanation.

Thanks

Answer 1

IEnumerable<string> inputList; // input words.
var mostFrequentlyUsed = inputList.GroupBy(word => word)
  .Select(wordGroup => new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
  .OrderByDescending(word => word.Frequency);

Explanation: I don't really know if it requires further explanation but I'll try. inputList is an array or any other collection providing source words. GroupBy function will group the input collection by some similar property (which is, in my code the object itself, as noted by the lambda word => word ). The output (which is a set of groups by a specified key, the word) will be transformed to an object with Word and Frequency properties and sorted by Frequency property in descending order. You could use .Take(10000) to take the first 10000. The whole thing can be easily parallelized by .AsParallel() provided by PLINQ. The query operator syntax might look clearer:

var mostFrequentlyUsed = 
     (from word in inputList
      group word by word into wordGroup
      select new { Word = wordGroup.Key, Frequency = wordGroup.Count() })
     .OrderByDescending(word => word.Frequency).Take(10000);

Answer 2

As a first cut, absent further definition of the problem (just what do you mean by the most-used words in English?) -- I'd buy Google's n-gram data , intersect the 1-grams with an English dictionary, and pipe that to sort -rn -k 2 | head -10000 sort -rn -k 2 | head -10000 .

Answer 3

I would use map-reduce. This is a canonical example of a task well-suited for it. You can use Hadoop with C# with the streaming protocol . There are also other approaches. See Is there a .NET equivalent to Apache Hadoop? and https://stackoverflow.com/questions/436686/-net-mapreduce-implementation .

Answer 4

First thing to pop into my head (not syntax checked, and verbose (for perl) for demonstrative purposes)

#!/usr/bin/perl

my %wordFreq
foreach ( my $word in @words)
{
   $wordFreq{$word}++;
}

my @mostPopularWords = sort{$wordFreq{$a} <=> $wordFreq{$b} } keys %wordFreq;
for (my $i=0; $i < 10000; ++$i)
{
   print "$i: $mostPopularWords[$i] ($wordFreq{$mostPopularWords[$i]} hits)\n"
}

How would you develop a frequency-sorted list of the ten thousand most-used words in the English language?

Question

4 answers

solution1
3 ACCPTED 2009-05-14 19:42:23

solution2
2 2009-05-15 02:50:38

solution3
1 2009-05-14 19:37:20

solution4
1 2009-05-14 19:47:22

How would you develop a frequency-sorted list of the ten thousand most-used words in the English language?

Question

4 answers

solution1 3 ACCPTED 2009-05-14 19:42:23

solution2 2 2009-05-15 02:50:38

solution3 1 2009-05-14 19:37:20

solution4 1 2009-05-14 19:47:22

solution1
3 ACCPTED 2009-05-14 19:42:23

solution2
2 2009-05-15 02:50:38

solution3
1 2009-05-14 19:37:20

solution4
1 2009-05-14 19:47:22