简体   繁体   English

使用正则表达式解析大字符串时出现 java.lang.StackOverflowError

[英]java.lang.StackOverflowError while using a RegEx to Parse big strings

This is my Regex这是我的正则表达式

((?:(?:'[^']*')|[^;])*)[;]

It tokenizes a string on semicolons.它用分号标记一个字符串。 For example,例如,

Hello world; I am having a problem; using regex;

Result is three strings结果是三个字符串

Hello world
I am having a problem
using regex

But when I use a large input string I get this error但是当我使用大输入字符串时,我收到此错误

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$Loop.match(Pattern.java:4295)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)

How is this caused and how can I solve it?这是怎么引起的,我该如何解决?

Unfortunately, Java's builtin regex support has problems with regexes containing repetitive alternative paths (that is, (A|B)* ).不幸的是,Java 的内置正则表达式支持在包含重复替代路径(即(A|B)* )的正则表达式方面存在问题。 This is compiled into a recursive call, which results in a StackOverflow error when used on a very large string.这被编译为递归调用,当用于非常大的字符串时会导致 StackOverflow 错误。

A possible solution is to rewrite your regex to not use a repititive alternative, but if your goal is to tokenize a string on semicolons, you don't need a complex regex at all really, just use String.split() with a simple ";"一个可能的解决方案是重写您的正则表达式以不使用重复的替代方案,但是如果您的目标是在分号上标记字符串,那么您实际上根本不需要复杂的正则表达式,只需将String.split()与简单的";" as the argument.作为论据。

如果您确实需要使用溢出堆栈的正则表达式,您可以通过将诸如 -Xss40m 之类的内容传递给 JVM 来增加堆栈的大小。

It might help to add a + after the [^;] , so that you have fewer repetitions.[^;]之后添加一个+可能会有所帮助,这样您就可以减少重复次数。

Isn't there also some construct that says “if the regular expression matched up to this point, don't backtrace”?是不是也有一些结构说“如果正则表达式匹配到这一点,不要回溯”? Maybe that comes in handy, too.也许这也派上用场。 (Update: it is called possessive quantifiers ). (更新:它被称为所有格量词)。

A completely different alternative is to write a utility method called splitQuoted(char quote, char separator, CharSequence s) that explicitly iterates over the string and remembers whether it has seen an odd number of quotes.一个完全不同的替代方法是编写一个名为splitQuoted(char quote, char separator, CharSequence s)的实用方法splitQuoted(char quote, char separator, CharSequence s)它显式地遍历字符串并记住它是否看到了奇数个引号。 In that method you could also handle the case that the quote character might need to be unescaped when it appears in a quoted string.在该方法中,您还可以处理引号字符出现在带引号的字符串中时可能需要取消转义的情况。

'I'm what I am', said the fox; and he disappeared.
'I\'m what I am', said the fox; and he disappeared.
'I''m what I am', said the fox; and he disappeared.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM