简体   繁体   English

检测字符串是否包含特定字符的最快方法

[英]fastest way to detect if string contains specific chars

I'm building an XML parser that goes over a big XML file and I'm looking for the fastest way to detect if a string contains a char(that isn't a " " , "\\n" or "\\r" ).我正在构建一个处理大型 XML 文件的 XML 解析器,我正在寻找检测字符串是否包含字符(不是" ""\\n""\\r" )的最快方法. I've tried using regex but it is too slow and heavy.我试过使用正则表达式,但它太慢太重。 Another method I tried was to get the ASCII number of " " , "\\n" and "\\r" and to reduce it from the size of the String, if it's larger then there's at least one char.我尝试的另一种方法是获取" ""\\n""\\r"的 ASCII 数字,并将其从字符串的大小中减少,如果它更大,则至少有一个字符。 This operation is also heavy.这个操作也很繁重。 Good advice would be appreciated.好的建议将不胜感激。

Edit - Clarification:编辑 - 澄清:

Too slow is 300 milliseconds for a line of XML parsing + string manipulation.一行XML解析+字符串操作太慢了300毫秒。

Examples to the 2 ways I implemented:我实施的两种方式的示例:

By Redex:通过 Redex:

if (!str.matches(".*\\w.*")
{
  // str that doesn't contains chars
}

By summing up ASCII values:通过总结 ASCII 值:

if (numOfWhitespaces + numOfSpecialChars >= str.length()) // +1 for ending /r in
  str
{
    // str that doesn't contains chars
}

The first solution(Regex) is slower in 200 milliseconds.第一个解决方案(Regex)在 200 毫秒内变慢。 On a file with 500+ lines (where each line is being processed independently) it's crucial.在具有 500 多行(其中每一行都被独立处理)的文件中,这一点至关重要。

I hope it's clear enough.我希望它足够清楚。 thanks!谢谢!

The fastest way to scan a String is with a SAX listener扫描字符串的最快方法是使用 SAX 侦听器

public void characters(char ch[], int start, int length) throws SAXException {
    for(int i=start, end = start+ length; i < end; i++) {
       if(ch[i] <= ' ') {
          // check if it is a white space
       }
    }
}

If you are not use a SAX parse or an event driven parser, this could be your performance bottleneck.如果您不使用 SAX 解析器或事件驱动的解析器,这可能是您的性能瓶颈。

too slow is 300 milliseconds for a line of xml parsing + the string manipulation. 一行xml解析+字符串操作太慢了300毫秒。 The String " " detector doen't connect with the XML parser so don't confuse it together. 字符串检测器不与XML解析器连接,因此请勿将其混淆在一起。 The origin of the String doesn't matter at all for this topic. 字符串的来源对于此主题完全无关紧要。

The solution is: 解决方案是:

By redex: 通过redex:

if (!str.matches(".*\\w.*")_
{
  // str that doesn't contains chars
}

By String methods: 通过String方法:

if (numOfWhitespaces + numOfSpecialChars >= str.length()) // +1 for ending /r in
  str
{
    // str that doesn't contains chars
}

those lines should be applied on any String( it doesn't matter what is the string's origin). 这些行应应用于任何字符串(字符串的来源无关紧要)。 The time of the first solution(regex solution) is slower in 200 milliseconds after running two parallel runs, one with each solution, on a file with 50 lines (where each line contains a string that need to be checked) 在包含50行的文件(其中每行包含一个需要检查的字符串)上运行两次并行运行后,第一个解决方案(正则表达式解决方案)的时间要慢200毫秒。

I hope it's clear enough. 我希望它足够清楚。 thanks! 谢谢!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM