[英]Fastest way to compare two strings with dynamic white space?
I have two strings bigstring
and smallstring
, and each string is a paragraph of words. 我有两个字符串
bigstring
和smallstring
,每个字符串是一段单词。 However in between each word is a bunch of whitespace ( \\s
in regex) characters of random length. 但是在每个单词之间是一堆随机长度的空格(正则表达式中的
\\s
)字符。
So for example bigstring
could be like hello world
. 所以例如
bigstring
可能就像hello world
。 And this goes for smallstring
too. 这也适用于小
smallstring
。
What I want to be able to do is, check if smallstring
is a substring of bigstring
(word for word) where the \\s+
part of it is considered the same, and case insensitively . 我希望能够做的是,检查是否
smallstring
是的一个子bigstring
其中(逐字逐句) \\s+
的一部分被认为是相同的,并且不区分大小写的情况下 。 So for example if 例如,如果
bigstring = "hello \\t\\r\\n world \\n foobar"
smallstring = "HELLO \\t world"
then smallstring
is a substring of bigstring
. 然后
smallstring
是的一个子bigstring
。
bigstring = "hello \\t\\r\\n world \\n foobar"
smallstring = "HEL"
This is not a substring (word for word), because there is no word called hel
in bigstring
. 这不是子字符串(逐字逐句),因为
bigstring
没有名为hel
的bigstring
。
bigstring = "the \\t\\r\\n nest"
smallstring = "then \\n est"
This is also not a substring (word for word). 这也不是子字符串(逐字逐句)。
One method is to tokenize both strings into arrays, so break up the stuff between \\s+
into tokens, and the \\s+
is the delimiters. 一种方法是将两个字符串标记为数组,因此将
\\s+
之间的内容分解为标记,而\\s+
是分隔符。 Then literally check if one array is contained in the other array in order and consecutively with case insensitively. 然后逐字地检查一个数组是否按顺序包含在另一个数组中,并且连续不区分大小写。
However in this case, I need speed to be the priority, as it should be the fastest way. 但是在这种情况下,我需要速度作为优先级,因为它应该是最快的方式。
Does anyone know a way to check this? 有谁知道检查这个的方法?
I was perhaps thinking of a way to check these strings as you loop through both, character by character, but not sure how to do that? 我或许想到一种方法来检查这些字符串,因为你逐个字符地循环,但不知道如何做到这一点?
Thanks 谢谢
I am not sure where this ranks on speed, but does this achieve your goal (now edited for edge case of 'impl' vs. 'mpl', by adding leading space) 我不确定这在速度上排名,但这是否达到了你的目标(现在通过添加领先空间编辑 'impl'与'mpl'的边缘情况)
var isSubstring = function(bigstring, smallstring) {
bigstring = " " + bigstring.replace(/\s+/g, " ").toLowerCase() + " "
smallstring = " " + smallstring.replace(/\s+/g, " ").toLowerCase() + " "
return(bigstring.indexOf(smallstring) >= 0)
}
Adding a trailing (and, now, leading) space covers the case where smallstring is a single word fragment ('hel' vs. 'hello' and 'impl' vs. 'mpl' in your example above and in comments below) 添加尾随(现在,前导)空间涵盖了smallstring是单个单词片段的情况(在上面的示例和下面的注释中,'hel'与'hello'和'impl'对比'mpl')
Use cases: 用例:
bigstring = "hello \t\r\n world \n foobar"
smallstring = "HELLO \t world"
console.log(isSubstring(bigstring, smallstring))
//evaluates to true
bigstring = "hello \t\r\n world \n foobar"
smallstring = "HEL"
console.log(isSubstring(bigstring, smallstring))
// evaluates to false
bigstring = "impl"
smallstring = "mpl"
console.log(isSubstring(bigstring, smallstring))
// evaluates to false
RegExp is definitely not the fastest, but you can search the big string with a RegExp
generated from the small string: RegExp绝对不是最快的,但您可以使用从小字符串生成的
RegExp
搜索大字符串:
bigstring = "hello \\t\\r\\n world \\n foobar" smallstring = "HELLO \\t world" r = new RegExp( '\\\\b' + smallstring.replace(/\\s+/g, '\\\\s+') + '\\\\b', 'i' ) console.log( r.test(bigstring), r ) // true /\\bHELLO\\s+world\\b/i
A faster case-insensitive string search would most likely use charCodeAt
and/or some kind of a word/token lookup structure, as for example https://github.com/bvaughn/js-search seems to use. 更快的不区分大小写的字符串搜索很可能使用
charCodeAt
和/或某种单词/标记查找结构,例如https://github.com/bvaughn/js-search似乎使用。
Let F(a)
will return unified version of string a
. 让
F(a)
返回字符串a
统一版本。 By unified I mean that all consecutive space characters will be replaced by a single space and all letters will be moved to lower case. 通过统一我的意思是所有连续的空格字符将被一个空格替换,所有字母将被移动到小写字母。 This function can be calculated in linear time -
O(|a|)
. 该函数可以在线性时间内计算 -
O(|a|)
。
In this case you need to check if F(smallstring)
is substring of F(bigstring)
. 在这种情况下,你需要检查,如果
F(smallstring)
是子F(bigstring)
To handle this quickly you can use some standard algo like KMP . 为了快速处理这个问题,你可以使用像KMP这样的标准算法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.