简体   繁体   English

将两个字符串与动态空白区域进行比较的最快方法?

[英]Fastest way to compare two strings with dynamic white space?

I have two strings bigstring and smallstring , and each string is a paragraph of words. 我有两个字符串bigstringsmallstring ,每个字符串是一段单词。 However in between each word is a bunch of whitespace ( \\s in regex) characters of random length. 但是在每个单词之间是一堆随机长度的空格(正则表达式中的\\s )字符。

So for example bigstring could be like hello world . 所以例如bigstring可能就像hello world And this goes for smallstring too. 这也适用于小smallstring

What I want to be able to do is, check if smallstring is a substring of bigstring (word for word) where the \\s+ part of it is considered the same, and case insensitively . 我希望能够做的是,检查是否smallstring是的一个子bigstring其中(逐字逐句) \\s+的一部分被认为是相同的,并且不区分大小写的情况下 So for example if 例如,如果

bigstring = "hello \\t\\r\\n world \\n foobar"

smallstring = "HELLO \\t world"

then smallstring is a substring of bigstring . 然后smallstring是的一个子bigstring

bigstring = "hello \\t\\r\\n world \\n foobar"

smallstring = "HEL"

This is not a substring (word for word), because there is no word called hel in bigstring . 这不是子字符串(逐字逐句),因为bigstring没有名为helbigstring

bigstring = "the \\t\\r\\n nest"

smallstring = "then \\n est"

This is also not a substring (word for word). 这也不是子字符串(逐字逐句)。

One method is to tokenize both strings into arrays, so break up the stuff between \\s+ into tokens, and the \\s+ is the delimiters. 一种方法是将两个字符串标记为数组,因此将\\s+之间的内容分解为标记,而\\s+是分隔符。 Then literally check if one array is contained in the other array in order and consecutively with case insensitively. 然后逐字地检查一个数组是否按顺序包含在另一个数组中,并且连续不区分大小写。

However in this case, I need speed to be the priority, as it should be the fastest way. 但是在这种情况下,我需要速度作为优先级,因为它应该是最快的方式。

Does anyone know a way to check this? 有谁知道检查这个的方法?

I was perhaps thinking of a way to check these strings as you loop through both, character by character, but not sure how to do that? 我或许想到一种方法来检查这些字符串,因为你逐个字符地循环,但不知道如何做到这一点?

Thanks 谢谢

I am not sure where this ranks on speed, but does this achieve your goal (now edited for edge case of 'impl' vs. 'mpl', by adding leading space) 我不确定这在速度上排名,但这是否达到了你的目标(现在通过添加领先空间编辑 'impl'与'mpl'的边缘情况)

var isSubstring = function(bigstring, smallstring) {
  bigstring = " " + bigstring.replace(/\s+/g, " ").toLowerCase() + " "
  smallstring = " " + smallstring.replace(/\s+/g, " ").toLowerCase() + " "
  return(bigstring.indexOf(smallstring) >= 0)
}

Adding a trailing (and, now, leading) space covers the case where smallstring is a single word fragment ('hel' vs. 'hello' and 'impl' vs. 'mpl' in your example above and in comments below) 添加尾随(现在,前导)空间涵盖了smallstring是单个单词片段的情况(在上面的示例和下面的注释中,'hel'与'hello'和'impl'对比'mpl')

Use cases: 用例:

bigstring = "hello   \t\r\n  world \n foobar"
smallstring = "HELLO \t world"
console.log(isSubstring(bigstring, smallstring))
//evaluates to true

bigstring = "hello   \t\r\n  world \n foobar"
smallstring = "HEL"
console.log(isSubstring(bigstring, smallstring))
// evaluates to false

bigstring = "impl"
smallstring = "mpl"
console.log(isSubstring(bigstring, smallstring))
// evaluates to false

RegExp is definitely not the fastest, but you can search the big string with a RegExp generated from the small string: RegExp绝对不是最快的,但您可以使用从小字符串生成的RegExp搜索大字符串:

 bigstring = "hello \\t\\r\\n world \\n foobar" smallstring = "HELLO \\t world" r = new RegExp( '\\\\b' + smallstring.replace(/\\s+/g, '\\\\s+') + '\\\\b', 'i' ) console.log( r.test(bigstring), r ) // true /\\bHELLO\\s+world\\b/i 

A faster case-insensitive string search would most likely use charCodeAt and/or some kind of a word/token lookup structure, as for example https://github.com/bvaughn/js-search seems to use. 更快的不区分大小写的字符串搜索很可能使用charCodeAt和/或某种单词/标记查找结构,例如https://github.com/bvaughn/js-search似乎使用。

Let F(a) will return unified version of string a . F(a)返回字符串a统一版本。 By unified I mean that all consecutive space characters will be replaced by a single space and all letters will be moved to lower case. 通过统一我的意思是所有连续的空格字符将被一个空格替换,所有字母将被移动到小写字母。 This function can be calculated in linear time - O(|a|) . 该函数可以在线性时间内计算 - O(|a|)

In this case you need to check if F(smallstring) is substring of F(bigstring) . 在这种情况下,你需要检查,如果F(smallstring)是子F(bigstring) To handle this quickly you can use some standard algo like KMP . 为了快速处理这个问题,你可以使用像KMP这样的标准算法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM