简体   繁体   English

我可以在仅组成每个字符串一部分的字符串数组中找到频繁出现的短语吗?

[英]Can I find frequently occuring phrases in an array of strings where the phrase only forms part of each string?

This has been asked before but wasn't ever answered. 这是以前问过的,但从未得到回答。

I want to search through an array of strings and find the most frequent phrases (2 or more words) that occur within those strings, so given: 我想搜索一个字符串数组,并找到那些字符串中出现频率最高的短语(2个或更多单词),因此给出:

["hello, my name is Emily, I'm from London", 
"this chocolate from London is  really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

I would want to get back something along the lines of: 我想找回类似的东西:

{"from London" => 3, "my name is" => 2 }

I don't really know how to approach this. 我真的不知道该如何处理。 Any suggestions would be awesome, even if it was just a strategy that I could test out. 任何建议都是很棒的,即使这只是我可以测试的策略。

This isn't a one stage process, but it is possible. 这不是一个阶段的过程,但是有可能。 Ruby knows what a character is, what a digit is, what a string is, etc. but it doesn't know what a phrase is. Ruby知道字符是什么,数字是什么,字符串是什么,等等,但是它不知道短语是什么。

You'd need to: 您需要:

  1. Begin with either building a list of phrases, or finding a list online. 从构建短语列表或在线查找列表开始。 This would then form the basis of the matching process. 然后,这将构成匹配过程的基础。

  2. Iterate through the list of phrases for each string, to see whether an instance of any of the phrases from the list occurs within that string. 遍历每个字符串的短语列表,以查看列表中任何短语的实例是否在该字符串内发生。

  3. Record a count of each instance of a phrase within a string. 记录一个字符串中短语的每个实例的计数。

Although it might not seen it, this is quite a high level question, so try to break down the task into smaller tasks. 尽管可能看不到,但这是一个较高级别的问题,因此请尝试将任务分解为较小的任务。

Here is something that might get you started. 这可能会让您入门。 This is brute forcing and will be very very slow for large data sets. 这是蛮力的,对于大数据集将非常慢。

x = ["hello, my name is Emily, I'm from London", 
"this chocolate from London is really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

word_maps = x.flat_map do |line|
  line = line.downcase.scan(/\w+/)
  (2..line.size).flat_map{|ix|line.each_cons(ix).map{|p|p.join(' ')}}
end

word_maps_hash = Hash[word_maps.group_by{|x|x}.reject{|x,y|y.size==1}.map{|x,y|[x,y.size]}]

original_hash_keys = word_maps_hash.keys
word_maps_hash.delete_if{|key, val| original_hash_keys.any?{|ohk| ohk[key] && ohk!=key}}

p word_maps_hash #=> {"from london"=>3, "my name is"=>2}

How about 怎么样

x = ["hello, my name is Emily, I'm from London", 
"this chocolate from London is really good", 
"my name is James, what did you say yours was", 
"is he from London?"]

words = x.map { |phrase| phrase.split(/[^\w\'']+/)}.flatten
word_pairs_array = words.each_cons(2)
word_pairs = word_pairs_array.map {|pair| pair.join(' ')}
counts = Hash.new 0
word_pairs.each {|pair| counts[pair] += 1}
pairs_occuring_twice_or_more = counts.select {|pair, count| count > 1}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM