简体   繁体   English

如何检测另一个字符串中1个字符串中是否存在子字符串?

[英]How do I detect the presence of a substring in 1 string in another string?

Say I have a string "rubinassociatespa" , what I would like to do is detect any substring of that string with 3 characters or more, in any other string. 假设我有一个字符串"rubinassociatespa" ,我想要做的是检测该字符串中包含3个字符或更多字符串的任何子字符串,在任何其他字符串中。

For example, the following strings should be detected: 例如,应检测以下字符串:

  • rubin
  • associates
  • spa
  • ass
  • rub etc. rub

But what should NOT be detected are the following strings: 但是不应该检测到的是以下字符串:

  • rob
  • cpa
  • dea
  • ru or any other substring that does not appear in my original string, or is shorter than 3 characters. ru或我的原始字符串中没有出现的任何其他子字符串,或者短于3个字符。

Basically, I have a string and I am comparing many other strings against it and I only want to match the strings that comprise a substring of the original string. 基本上,我有一个字符串,我正在比较许多其他字符串,我只想匹配组成原始字符串的子字符串的字符串。

I hope that's clear. 我希望这很清楚。

str = "rubinassociatespa"

arr = %w| rubin associates spa ass rub rob cpa dea ru |
  #=> ["rubin", "associates", "spa", "ass", "rub", "rob", "cpa", "dea", "ru"]

Just use String#include? 只需使用String #include? .

def substring?(str, s)
  (s.size >= 3) ? str.include?(s) : false
end

arr.each { |s| puts "#{s}: #{substring? str, s}" }
  # rubin: true
  # associates: true
  # spa: true
  # ass: true
  # rub: true
  # rob: false
  # cpa: false
  # dea: false
  # ru: false

you can use match 你可以使用match

str = "rubinassociatespa"

test_str = "associates"

str.match(test_str) #=> #<MatchData "associates">
str.match(test_str).to_s #=> "associates"

test_str = 'rob'

str.match(test_str) #=> nil

So, if test_str is a substring of str , then the match method will return the entire test_str , otherwise, it will return nil . 因此,如果test_strstr的子str ,则match方法将返回整个test_str ,否则返回nil

if test_str.length >= 3 && str.match(test_str)
  # do stuff here. 
end

First you need a list of acceptable strings. 首先,您需要一个可接受的字符串列表。 Something like https://github.com/first20hours/google-10000-english would probably be usefull. https://github.com/first20hours/google-10000-english这样的东西可能会很有用。

Secondly you want a data structure that allows for fast lookups to see if a word is valid. 其次,您需要一种允许快速查找以查看单词是否有效的数据结构。 I would use a Bloom Filter for this. 我会为此使用Bloom Filter。 This gem might be useful if you don't want to implement it on your own: https://github.com/igrigorik/bloomfilter-rb 如果您不想自己实现它,这个gem可能很有用: https//github.com/igrigorik/bloomfilter-rb

Then you need to initiate the Bloom filter with the list of all valid words in the valid word list. 然后,您需要使用有效单词列表中所有有效单词的列表启动Bloom过滤器。

Then, For each substring in your string you want to do a lookup in the bloom filter structure to see if it is in the valid word list. 然后,对于字符串中的每个子字符串,您希望在bloom过滤器结构中进行查找,以查看它是否在有效单词列表中。 See this example for how to get all substrings: What is the best way to split a string to get all the substrings by Ruby? 请参阅此示例以了解如何获取所有子字符串: 拆分字符串以获取Ruby的所有子字符串的最佳方法是什么?

If the bloom filter returns true you need to do a secondary check to confirm that it is actually in the list since Bloom filters is a probabilistic data structure. 如果bloom过滤器返回true,则需要进行二次检查以确认它实际上在列表中,因为Bloom过滤器是概率数据结构。 You probably need to use a database to store the valid word list collection, so you can just do a database lookup to confirm if it's valid. 您可能需要使用数据库来存储有效的单词列表集合,因此您可以只进行数据库查找以确认它是否有效。

I hope this gives you an idea on how to proceed. 我希望这可以让你了解如何继续。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM