[英]Use ruby to remove a part of a string on each entry in an array where it exists
[英]Can I find frequently occuring phrases in an array of strings where the phrase only forms part of each string?
這是以前問過的,但從未得到回答。
我想搜索一個字符串數組,並找到那些字符串中出現頻率最高的短語(2個或更多單詞),因此給出:
["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
我想找回類似的東西:
{"from London" => 3, "my name is" => 2 }
我真的不知道該如何處理。 任何建議都是很棒的,即使這只是我可以測試的策略。
這不是一個階段的過程,但是有可能。 Ruby知道字符是什么,數字是什么,字符串是什么,等等,但是它不知道短語是什么。
您需要:
從構建短語列表或在線查找列表開始。 然后,這將構成匹配過程的基礎。
遍歷每個字符串的短語列表,以查看列表中任何短語的實例是否在該字符串內發生。
記錄一個字符串中短語的每個實例的計數。
盡管可能看不到,但這是一個較高級別的問題,因此請嘗試將任務分解為較小的任務。
這可能會讓您入門。 這是蠻力的,對於大數據集將非常慢。
x = ["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
word_maps = x.flat_map do |line|
line = line.downcase.scan(/\w+/)
(2..line.size).flat_map{|ix|line.each_cons(ix).map{|p|p.join(' ')}}
end
word_maps_hash = Hash[word_maps.group_by{|x|x}.reject{|x,y|y.size==1}.map{|x,y|[x,y.size]}]
original_hash_keys = word_maps_hash.keys
word_maps_hash.delete_if{|key, val| original_hash_keys.any?{|ohk| ohk[key] && ohk!=key}}
p word_maps_hash #=> {"from london"=>3, "my name is"=>2}
怎么樣
x = ["hello, my name is Emily, I'm from London",
"this chocolate from London is really good",
"my name is James, what did you say yours was",
"is he from London?"]
words = x.map { |phrase| phrase.split(/[^\w\'']+/)}.flatten
word_pairs_array = words.each_cons(2)
word_pairs = word_pairs_array.map {|pair| pair.join(' ')}
counts = Hash.new 0
word_pairs.each {|pair| counts[pair] += 1}
pairs_occuring_twice_or_more = counts.select {|pair, count| count > 1}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.