简体   繁体   中英

Most common words in string

I am new to Ruby and trying to write a method that will return an array of the most common word(s) in a string. If there is one word with a high count, that word should be returned. If there are two words tied for the high count, both should be returned in an array.

The problem is that when I pass through the 2nd string, the code only counts "words" twice instead of three times. When the 3rd string is passed through, it returns "it" with a count of 2, which makes no sense, as "it" should have a count of 1.

def most_common(string)
  counts = {}
  words = string.downcase.tr(",.?!",'').split(' ')

  words.uniq.each do |word|
    counts[word] = 0
  end

  words.each do |word|
    counts[word] = string.scan(word).count
  end

  max_quantity = counts.values.max
  max_words = counts.select { |k, v| v == max_quantity }.keys
  puts max_words
end

most_common('a short list of words with some words') #['words']
most_common('Words in a short, short words, lists of words!') #['words']
most_common('a short list of words with some short words in it') #['words', 'short']

Your method of counting instances of the word is your problem. it is in with , so it's double counted.

[1] pry(main)> 'with some words in it'.scan('it')
=> ["it", "it"]

It can be done easier though, you can group an array's contents by the number of instances of the values using an each_with_object call, like so:

counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }

This goes through each entry in the array and adds 1 to the value for each word's entry in the hash.

So the following should work for you:

def most_common(string)
  words = string.downcase.tr(",.?!",'').split(' ')
  counts = words.each_with_object(Hash.new(0)) { |e, h| h[e] += 1 }
  max_quantity = counts.values.max
  counts.select { |k, v| v == max_quantity }.keys
end

p most_common('a short list of words with some words') #['words']
p most_common('Words in a short, short words, lists of words!') #['words']
p most_common('a short list of words with some short words in it') #['words', 'short']

As Nick has answered your question, I will just suggest another way this can be done. As "high count" is vague, I suggest you return a hash with downcased words and their respective counts. Since Ruby 1.9, hashes retain the order that key-value pairs have been entered, so we may want to make use of that and return the hash with key-value pairs ordered in decreasing order of values.

Code

def words_by_count(str)
  str.gsub(/./) do |c|
    case c
    when /\w/ then c.downcase
    when /\s/ then c
    else ''
    end
  end.split
     .group_by {|w| w}
     .map {|k,v| [k,v.size]}
     .sort_by(&:last)
     .reverse
     .to_h
end
words_by_count('Words in a short, short words, lists of words!')

The method Array#h was introduced in Ruby 2.1. For earlier Ruby versions, one must use:

Hash[str.gsub(/./)... .reverse]

Example

words_by_count('a short list of words with some words')
  #=> {"words"=>2, "of"=>1, "some"=>1, "with"=>1,
  #    "list"=>1, "short"=>1, "a"=>1}
words_by_count('Words in a short, short words, lists of words!')
  #=> {"words"=>3, "short"=>2, "lists"=>1, "a"=>1, "in"=>1, "of"=>1}
words_by_count('a short list of words with some short words in it')
  #=> {"words"=>2, "short"=>2, "it"=>1, "with"=>1,
  #    "some"=>1, "of"=>1, "list"=>1, "in"=>1, "a"=>1}

Explanation

Here is what's happening in the second example, where:

str = 'Words in a short, short words, lists of words!'

str.gsub(/./) do |c|... matches each character in the string and sends it to the block to decide what do with it. As you see, word characters are downcased, whitespace is left alone and everything else is converted to a blank space.

s = str.gsub(/./) do |c|
      case c
      when /\w/ then c.downcase
      when /\s/ then c
      else ''
      end
    end
  #=> "words in a short short words lists of words"

This is followed by

a = s.split
 #=> ["words", "in", "a", "short", "short", "words", "lists", "of", "words"]
h = a.group_by {|w| w}
 #=> {"words"=>["words", "words", "words"], "in"=>["in"], "a"=>["a"],
 #    "short"=>["short", "short"], "lists"=>["lists"], "of"=>["of"]}
b = h.map {|k,v| [k,v.size]}
 #=> [["words", 3], ["in", 1], ["a", 1], ["short", 2], ["lists", 1], ["of", 1]]
c = b.sort_by(&:last)
 #=> [["of", 1], ["in", 1], ["a", 1], ["lists", 1], ["short", 2], ["words", 3]]
d = c.reverse
 #=> [["words", 3], ["short", 2], ["lists", 1], ["a", 1], ["in", 1], ["of", 1]]
d.to_h # or Hash[d]
 #=> {"words"=>3, "short"=>2, "lists"=>1, "a"=>1, "in"=>1, "of"=>1}

Note that c = b.sort_by(&:last) , d = c.reverse can be replaced by:

d = b.sort_by { |_,k| -k }
 #=> [["words", 3], ["short", 2], ["a", 1], ["in", 1], ["lists", 1], ["of", 1]]

but sort followed by reverse is generally faster.

def count_words string
  word_list = Hash.new(0)
  words     = string.downcase.delete(',.?!').split
  words.map { |word| word_list[word] += 1 }
  word_list
end

def most_common_words string
  hash      = count_words string
  max_value = hash.values.max
  hash.select { |k, v| v == max_value }.keys
end

most_common 'a short list of words with some words'
#=> ["words"]

most_common 'Words in a short, short words, lists of words!'
#=> ["words"]

most_common 'a short list of words with some short words in it'
#=> ["short", "words"]

Assuming string is a string containing multiple words.

words = string.split(/[.!?,\s]/)
words.sort_by{|x|words.count(x)}

Here we split the words in an string and add them to an array. We then sort the array based on the number of words. The most common words will appear at the end.

The same thing can be done in the following way too:

def most_common(string)
  counts = Hash.new 0
  string.downcase.tr(",.?!",'').split(' ').each{|word| counts[word] += 1}
  # For "Words in a short, short words, lists of words!"
  # counts ---> {"words"=>3, "in"=>1, "a"=>1, "short"=>2, "lists"=>1, "of"=>1} 
  max_value = counts.values.max
  #max_value ---> 3
  return counts.select{|key , value| value == counts.values.max}
  #returns --->  {"words"=>3}
end

This is just a shorter solution, which you might want to use. Hope it helps :)

This is the kind of question programmers love, isn't it :) How about a functional approach?

# returns array of words after removing certain English punctuations
def english_words(str)
  str.downcase.delete(',.?!').split
end

# returns hash mapping element to count
def element_counts(ary)
  ary.group_by { |e| e }.inject({}) { |a, e| a.merge(e[0] => e[1].size) }
end

def most_common(ary)
  ary.empty? ? nil : 
    element_counts(ary)
      .group_by { |k, v| v }
      .sort
      .last[1]
      .map(&:first)
end

most_common(english_words('a short list of words with some short words in it'))
#=> ["short", "words"]
def firstRepeatedWord(string)
  h_data = Hash.new(0)
  string.split(" ").each{|x| h_data[x] +=1}
  h_data.key(h_data.values.max)
end
def common(string)
  counts=Hash.new(0)
  words=string.downcase.delete('.,!?').split(" ")
  words.each {|k| counts[k]+=1}
  p counts.sort.reverse[0]
end

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM