简体   繁体   English

使用正则表达式的时间复杂度

[英]Time Complexity Using Regex

I was Doing A question where i had to calculate the no of words in a string as a part of the problem.我正在做一个问题,我必须计算字符串中的单词数作为问题的一部分。

eg: "hi i am a programmer" this should return 5. now i thought of two methods to do this:例如:“嗨,我是程序员”这应该返回 5。现在我想到了两种方法来做到这一点:

  1. Using split使用拆分
String[] words=messages[i].split("\\s");
int length=words.length;
  1. Using while loop:使用 while 循环:
int getwordCount(String message)
{
        int result = 1;

        //for(int i=0;i<message.length();i++){
        //  char ch=message.charAt(i);

        for (char ch : message.toCharArray())
        {
            if (ch == ' ') 
                ++result;
        }
        return result; 
}

in some cases the 2nd method was proving more efficient and i was getting better time result what method is better to use and why since the running TC for .split() is O(n) which will be similar to the TC of 2nd method which would be O(n) .在某些情况下,第二种方法被证明更有效,我得到了更好的时间结果,哪种方法更好用,为什么因为 .split() 的运行 TC 是 O(n),这将类似于第二种方法的 TC将是 O(n) 。 Even if i do not discard the use of .toCharArray() which is O(n) the method still gives better result.即使我不放弃使用 O(n) 的 .toCharArray() ,该方法仍然可以提供更好的结果。

The only explanation i can think of was of using the regex \\s.我能想到的唯一解释是使用正则表达式 \\s. what exactly is going on?到底发生了什么?

When you split the array on whitespaces using regex, the regex engine will have to walk down the string once, and make the splits.当您使用正则表达式在空格上拆分数组时,正则表达式引擎将不得不遍历字符串一次,然后进行拆分。 This option also requires allocating a new String array and populating it with the individual words.此选项需要分配一个新的String数组并用单个单词填充它。 The array step increases the running time as well as the storage requirements.阵列步骤增加了运行时间以及存储要求。

Your second version, while certainly more verbose, also only requires a single walk down the String .您的第二个版本虽然肯定更冗长,但也只需要一次遍历String However, the second loop version totally avoids the allocation of a String array and the time/space required to populate it.但是,第二个循环版本完全避免了分配String数组和填充它所需的时间/空间。 Therefore, I would expect the second version to outperform the first.因此,我希望第二个版本优于第一个版本。

That being said, we don't necessarily use regex because of its performance, but rather the simplicity of code it offers.话虽如此,我们不一定要使用正则表达式,因为它的性能,而是它提供的代码的简单性。 I would probably always use the first string split version in a production code base, unless super high performance were absolutely required (eg in an Android app).我可能总是在生产代码库中使用第一个字符串拆分版本,除非绝对需要超高性能(例如在 Android 应用程序中)。

I'm by no means a code optimization expert but i would asume the second method to be faster, especially in large text block since the first needs to create N number of string in memory where the second works of the char array of the first.我绝不是代码优化专家,但我认为第二种方法更快,尤其是在大文本块中,因为第一种方法需要在内存中创建 N 个字符串,而第二种方法是第一个的 char 数组。 If you where to use a regex i think the result would be faster if you use a pattern matcher for the \\s pattern and count the matches in a while loop如果您在哪里使用正则表达式,我认为如果您对\\s模式使用模式匹配器并在 while 循环中计算匹配,结果会更快

Pattern pattern = Pattern.compile("\\s");
Matcher matcher = pattern.matcher(yourString);
int count = 0;
while (matcher.find()) {
    count++;
}

First of all your two methods are both wrong.首先你的两种方法都是错误的。 You're counting the number of space-separated "things" (needn't be words) in the first method rather than counting the number of non-space sequences:您在第一种方法中计算空格分隔的“事物”(不必是单词)的数量,而不是计算非空格序列的数量:

jshell> ("some  words").split(" ")
$2 ==> String[3] { "some", "", "words" }

jshell> (" leading and trailing spacing ").split(" ")
$3 ==> String[5] { "", "leading", "and", "trailing", "spacing" }

jshell> ("").split(" ")
$4 ==> String[1] { "" }

jshell> (" ").split(" ")
$5 ==> String[0] {  }

as you can see the word count does not match the space-delimited content count.如您所见,字数与以空格分隔的内容数不匹配。 Counting the spaces suffers from the same issues, except it will also fail on strings with "only" trailing spacing.计算空格也会遇到同样的问题,除了它也会在“仅”尾随空格的字符串上失败。

Both the RegEx and the for -loop run in linear time O(n), having to visit each character exactly once; RegEx 和for循环都在线性时间 O(n) 中运行,必须只访问每个字符一次; using a RegEx requires first compiling the RegEx, but for such a simple RegEx we can reasonably neglect this.使用 RegEx 需要首先编译 RegEx,但对于这样一个简单的 RegEx,我们可以合理地忽略这一点。 This does not mean that they take the same time to complete though, the constant factor may very well differ.这并不意味着它们需要相同的时间才能完成,但常数因素可能会非常不同。 As others have pointed out already the RegEx + splitting obviously incurs a significant overhead - especially as this requires auxiliary space O(n) whereas the simple for loop counting variant requires just constant auxiliary space O(1) to keep track of the count and the current index.正如其他人已经指出的那样,RegEx + 拆分显然会产生很大的开销 - 特别是因为这需要辅助空间 O(n) 而简单的循环计数变体只需要恒定的辅助空间 O(1) 来跟踪计数和当前指数。

Fixing & improving your code using RegEx is quite easy:使用 RegEx 修复和改进代码非常简单:

int count = 0;
Matcher matcher = Pattern.compile("\\S+").matcher(message);
while (matcher.find()) count++;

You'll likely want to use a character class like \w (alphanumerics) for words rather than looking for non-space characters \S .您可能希望对单词使用像\w (字母数字)这样的字符类,而不是寻找非空格字符\S

Or if you want to implement this manually using the for -loop as it may be slightly faster:或者,如果您想使用for -loop 手动实现它,因为它可能会稍微快一些:

int count = message.charAt(0) == ' ' ? 0 : 1; // does the text start with a non-space character?
for (int i = 1; i < message.length; i++) {
    // beginning of word: transition space -> non-space
    if (message.charAt(i) != ' ' && message.charAt(i-1) == ' ')
        count++
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM