简体   繁体   中英

Can I find frequently occuring phrases in an array of strings where the phrase only forms part of each string?

This has been asked before but wasn't ever answered.

I want to search through an array of strings and find the most frequent phrases (2 or more words) that occur within those strings, so given:

["hello, my name is Emily, I'm from London", 
"this chocolate from London is  really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

I would want to get back something along the lines of:

{"from London" => 3, "my name is" => 2 }

I don't really know how to approach this. Any suggestions would be awesome, even if it was just a strategy that I could test out.

This isn't a one stage process, but it is possible. Ruby knows what a character is, what a digit is, what a string is, etc. but it doesn't know what a phrase is.

You'd need to:

  1. Begin with either building a list of phrases, or finding a list online. This would then form the basis of the matching process.

  2. Iterate through the list of phrases for each string, to see whether an instance of any of the phrases from the list occurs within that string.

  3. Record a count of each instance of a phrase within a string.

Although it might not seen it, this is quite a high level question, so try to break down the task into smaller tasks.

Here is something that might get you started. This is brute forcing and will be very very slow for large data sets.

x = ["hello, my name is Emily, I'm from London", 
"this chocolate from London is really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

word_maps = x.flat_map do |line|
  line = line.downcase.scan(/\w+/)
  (2..line.size).flat_map{|ix|line.each_cons(ix).map{|p|p.join(' ')}}
end

word_maps_hash = Hash[word_maps.group_by{|x|x}.reject{|x,y|y.size==1}.map{|x,y|[x,y.size]}]

original_hash_keys = word_maps_hash.keys
word_maps_hash.delete_if{|key, val| original_hash_keys.any?{|ohk| ohk[key] && ohk!=key}}

p word_maps_hash #=> {"from london"=>3, "my name is"=>2}

How about

x = ["hello, my name is Emily, I'm from London", 
"this chocolate from London is really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

words = x.map { |phrase| phrase.split(/[^\w\'']+/)}.flatten
word_pairs_array = words.each_cons(2)
word_pairs = word_pairs_array.map {|pair| pair.join(' ')}
counts = Hash.new 0
word_pairs.each {|pair| counts[pair] += 1}
pairs_occuring_twice_or_more = counts.select {|pair, count| count > 1}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM