[英]Use ruby to remove a part of a string on each entry in an array where it exists
[英]Can I find frequently occuring phrases in an array of strings where the phrase only forms part of each string?
这是以前问过的,但从未得到回答。
我想搜索一个字符串数组,并找到那些字符串中出现频率最高的短语(2个或更多单词),因此给出:
["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
我想找回类似的东西:
{"from London" => 3, "my name is" => 2 }
我真的不知道该如何处理。 任何建议都是很棒的,即使这只是我可以测试的策略。
这不是一个阶段的过程,但是有可能。 Ruby知道字符是什么,数字是什么,字符串是什么,等等,但是它不知道短语是什么。
您需要:
从构建短语列表或在线查找列表开始。 然后,这将构成匹配过程的基础。
遍历每个字符串的短语列表,以查看列表中任何短语的实例是否在该字符串内发生。
记录一个字符串中短语的每个实例的计数。
尽管可能看不到,但这是一个较高级别的问题,因此请尝试将任务分解为较小的任务。
这可能会让您入门。 这是蛮力的,对于大数据集将非常慢。
x = ["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
word_maps = x.flat_map do |line|
line = line.downcase.scan(/\w+/)
(2..line.size).flat_map{|ix|line.each_cons(ix).map{|p|p.join(' ')}}
end
word_maps_hash = Hash[word_maps.group_by{|x|x}.reject{|x,y|y.size==1}.map{|x,y|[x,y.size]}]
original_hash_keys = word_maps_hash.keys
word_maps_hash.delete_if{|key, val| original_hash_keys.any?{|ohk| ohk[key] && ohk!=key}}
p word_maps_hash #=> {"from london"=>3, "my name is"=>2}
怎么样
x = ["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
words = x.map { |phrase| phrase.split(/[^\w\'']+/)}.flatten
word_pairs_array = words.each_cons(2)
word_pairs = word_pairs_array.map {|pair| pair.join(' ')}
counts = Hash.new 0
word_pairs.each {|pair| counts[pair] += 1}
pairs_occuring_twice_or_more = counts.select {|pair, count| count > 1}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.