[英]Cancelling a long running regex match?

假設我正在運行一項服務,用戶可以在其中提交正則表達式以搜索大量數據。 如果用戶提交的正則表達式非常慢(即 Matcher.find() 需要幾分鍾才能返回),我想要一種取消該匹配的方法。 我能想到的唯一方法是讓另一個線程監控匹配持續的時間,並在必要時使用 Thread.stop() 取消它。


long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;


try {
    matcherThread = Thread.currentThread();

    // imagine code to start monitor thread is here

    try {
        matched = matcher.find();
    } finally {
        synchronized (lock) {
            finished = true;
} catch (ThreadDeath td) {
    // send angry message to client
    // handle error without rethrowing td


synchronized (lock) {
    while (! finished) {
        try {

            if (! finished) {
        } catch (InterruptedException ex) {
            // ignore, top level method in dedicated thread, etc..

我已經閱讀了 java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html 並且我認為這種用法是安全的,因為我正在控制 ThreadDeath 通過同步拋出的位置並處理它以及唯一損壞的對象可能是我的 Pattern 和 Matcher 實例,無論如何都會被丟棄。 我認為這會破壞 Thread.stop() ,因為我沒有重新拋出錯誤,但我真的不希望線程死掉,只是中止 find() 方法。

到目前為止,我已經設法避免使用這些已棄用的 API 組件,但 Matcher.find() 似乎不可中斷,並且可能需要很長時間才能返回。 有沒有更好的方法來做到這一點?

來自 Heritrix:( crawler.archive.org

 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * @author gojomo
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        this.inner = inner;

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        // counter++;
        return inner.charAt(index);

    public int length() {
        return inner.length();

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));

    public String toString() {
        return inner.toString();

用這個包裹你的 CharSequence 線程中斷將起作用......


public class RegularExpressionUtils {

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
        return regularExpressionPattern.matcher(charSequence);

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final int timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;

        public char charAt(int index) {
            if (System.currentTimeMillis() > timeoutTime) {
                throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                                + regularExpression + "' on input '" + stringToMatch + "'!");
            return inner.charAt(index);

        public int length() {
            return inner.length();

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);

        public String toString() {
            return inner.toString();


非常感謝 dawce 為我指出這個解決方案來回答一個不必要的復雜問題

也許您需要的是一個實現 NFA 算法的新庫。

NFA 算法比 Java 標准庫使用的算法快數百倍。

並且 Java std lib 對輸入正則表達式很敏感,這可能會使您的問題發生 - 某些輸入使 CPU 運行多年。

NFA 算法可以通過它使用的步驟設置超時。 它比線程解決方案更有效。 相信我,我使用線程超時來解決相​​關問題,這對性能來說太可怕了。 我終於通過修改我的算法實現的主循環來解決這個問題。 我在主循環中插入一些檢查點來測試時間。

詳細信息可以在這里找到: https : //swtch.com/~rsc/regexp/regexp1.html


  • 創建管理模式匹配狀態的StateFulCharSequence類。 當該狀態發生更改時,將在下次調用charAt方法時引發異常。
  • 該狀態更改可以使用ScheduledExecutorService進行ScheduledExecutorService並具有所需的超時時間。
  • 這里模式匹配發生在主線程中,不需要每次都檢查線程中斷狀態。

     public class TimedPatternMatcher { public static void main(String[] args) { ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1); Pattern pattern = Pattern.compile("some regex pattern"); StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence"); Matcher matcher = pattern.matcher(stateFulCharSequence); executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS); try { boolean isMatched = matcher.find(); }catch (Exception e) { e.printStackTrace(); } } /* When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception. */ public static class StateFulCharSequence implements CharSequence, Runnable{ private CharSequence inner; private boolean isTimedOut = false; public StateFulCharSequence(CharSequence inner) { super(); this.inner = inner; } public char charAt(int index) { if (isTimedOut) { throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs")); } return inner.charAt(index); } @Override public int length() { return inner.length(); } @Override public CharSequence subSequence(int start, int end) { return new StateFulCharSequence(inner.subSequence(start, end)); } @Override public String toString() { return inner.toString(); } public void setTimedOut() { this.isTimedOut = true; } @Override public void run() { this.isTimedOut = true; } }}

我包含了一個計數器來檢查 charAt 的每 n 次讀取,以減少開銷。


有人說 carAt 可能調用不夠頻繁。 我剛剛添加了 foo 變量以演示調用了多少 charAt,並且它足夠頻繁。 如果您打算在生產中使用它,請刪除該計數器,因為如果在服務器中運行很長時間,它會降低性能並最終導致長時間溢出。 在這個例子中,charAt 每 0.8 秒左右被調用 3000 萬次(沒有用適當的微基准條件測試,這只是一個概念證明)。 如果你想要更高的精度,你可以設置一個較低的 checkInterval,以犧牲性能為代價(從長遠來看,System.currentTimeMillis() > timeoutTime 比 if 子句更昂貴。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.goikosoft.test.RegexpTimeoutException;

 * Allows to create timeoutable regular expressions.
 * Limitations: Can only throw RuntimeException. Decreases performance.
 * Posted by Kris in stackoverflow.
 * Modified by dgoiko to  ejecute timeout check only every n chars.
 * Now timeout < 0 means no timeout.
 * @author Kris https://stackoverflow.com/a/910798/9465588
public class RegularExpressionUtils {

    public static long foo = 0;

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        long millis = System.currentTimeMillis();
        // This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
        try {
        } catch (RuntimeException e) {
            System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
                                                      int checkInterval) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
                                                    long timeoutMillis, int checkInterval) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern(), checkInterval);
        return regularExpressionPattern.matcher(charSequence);

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final long timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        private int checkInterval;

        private int attemps;

        TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
                                  String regularExpression, int checkInterval) {
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
            this.checkInterval = checkInterval;
            this.attemps = 0;

        public char charAt(int index) {
            if (this.attemps == this.checkInterval) {
                if (System.currentTimeMillis() > timeoutTime) {
                    throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
                this.attemps = 0;
            } else {

            return inner.charAt(index);

        public int length() {
            return inner.length();

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
                                                regularExpression, checkInterval);

        public String toString() {
            return inner.toString();


和自定義異常,所以你只能捕獲那個異常,以避免吞下其他 RE Pattern / Matcher 可能拋出的異常。

public class RegexpTimeoutException extends RuntimeException {
    private static final long serialVersionUID = 6437153127902393756L;

    private final String regularExpression;

    private final String stringToMatch;

    private final long timeoutMillis;

    public RegexpTimeoutException() {
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;

    public RegexpTimeoutException(String message, Throwable cause) {
        super(message, cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;

    public RegexpTimeoutException(String message) {
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;

    public RegexpTimeoutException(Throwable cause) {
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;

    public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
        super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                + regularExpression + "' on input '" + stringToMatch + "'!");
        this.regularExpression = regularExpression;
        this.stringToMatch = stringToMatch;
        this.timeoutMillis = timeoutMillis;

    public String getRegularExpression() {
        return regularExpression;

    public String getStringToMatch() {
        return stringToMatch;

    public long getTimeoutMillis() {
        return timeoutMillis;


基於安德烈亞斯的回答 主要功勞應該歸於他和他的來源。













(.*a){x} for x \> 10

我可能對 Regex 和 Regex DoS 有點天真,但我不禁想到,對已知的“邪惡”模式進行一些預篩選對防止執行時出現問題大有幫助,尤其是如果有問題的正則表達式是由最終用戶提供的輸入。 上面的模式可能不夠精致,因為我遠不是正則表達式的專家。 這只是深思熟慮,因為我在那里發現的所有其他內容似乎都表明它無法完成,並且專注於在正則表達式引擎上設置超時,或限制允許執行的迭代次數.

另一種解決方法是限制匹配器的區域,然后調用find() ,重復直到線程中斷或找到匹配。


