简体   繁体   English

最有效的句子拆分方法

[英]Most efficient way to split sentence

I am writing an application that relies heavily on separating large strings into individual words. 我正在编写一个非常依赖于将大字符串分成单个单词的应用程序。 Because I have to deal with so many strings I am concerned about efficiency. 因为我必须处理很多字符串,所以我担心效率。 I am using String.split to do this but I do not know if there is a more efficient way to accomplish this. 我正在使用String.split来执行此操作,但我不知道是否有更有效的方法来完成此操作。

private static String[] printWords(String input) {
        String splitWords[] = input.split(" ");
        return splitWords;
    }

When I timed it a few years ago, (Java 6) String.split() was significantly slower than searching for individual space characters with indexOf(), cause the former has a lot of regex baggage. 几年前,当我为它计时时,(Java 6)String.split()比使用indexOf()搜索单个空格字符要慢得多,因为前者有很多正则表达式包g。

If your sentences always split on a space, (somewhat dubious?) and that performance is truly an issue (do some real tests), custom code would be faster. 如果您的句子总是在空格上分开(有点可疑?),而性能确实是个问题(进行一些实际测试),那么自定义代码会更快。

Following the link provided in David Ehrmann's comment, looks like Java 7 made some speedups. 遵循David Ehrmann的评论中提供的链接,看起来Java 7有了一些加速。 My tests were with Java 6. 我的测试是在Java 6上进行的。

While the Sun/Oracle guys did a decent job in general, there's still room for improvement, especially because you can specialize for your concrete problem. 尽管Sun / Oracle的人员总体上做得不错,但仍有改进的空间,尤其是因为您可以专门解决您的具体问题。 Sometimes, you can hit a case when a huge speedup factor is achievable, when you don't rely on the JITC to do all the job perfectly out of the box. 有时,您会遇到这样一个情况,即可以实现巨大的加速因子,而不是依靠JITC来完成所有现成的工作。 Such cases are rare, but exist . 这种情况很少见,但确实存在

For example String.split calls Pattern.compile for the general case and then a precomputed Pattern is a sure a win. 例如,对于一般情况, String.split调用Pattern.compile ,然后预先计算的Pattern肯定会赢。

There's an optimization for single char patterns avoiding the regex overhead, so the possible gain is limited. 对单个字符模式进行了优化,避免了正则表达式的开销,因此可能的增益受到限制。 Still, I'd try Guava's Splitter and a hand-crafted solution, if performance is really important. 如果性能真的很重要,我还是会尝试使用番石榴的Splitter和手工制作的解决方案。

Probably you'll find out that splitting on space is not what you want and then the gain will be bigger. 可能您会发现空间分割不是您想要的,然后增益会更大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM