簡體   English   中英

取消長時間運行的正則表達式匹配?

[英]Cancelling a long running regex match?

假設我正在運行一項服務,用戶可以在其中提交正則表達式以搜索大量數據。 如果用戶提交的正則表達式非常慢(即 Matcher.find() 需要幾分鍾才能返回),我想要一種取消該匹配的方法。 我能想到的唯一方法是讓另一個線程監控匹配持續的時間,並在必要時使用 Thread.stop() 取消它。

成員變量:

long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;

匹配線程:

try {
    matcherThread = Thread.currentThread();

    // imagine code to start monitor thread is here

    try {
        matched = matcher.find();
    } finally {
        synchronized (lock) {
            finished = true;
            lock.notifyAll();
        }
    }
} catch (ThreadDeath td) {
    // send angry message to client
    // handle error without rethrowing td
}

監控線程:

synchronized (lock) {
    while (! finished) {
        try {
            lock.wait(REGEX_TIMEOUT);

            if (! finished) {
                matcherThread.stop();
            }
        } catch (InterruptedException ex) {
            // ignore, top level method in dedicated thread, etc..
        }
    }
}

我已經閱讀了 java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html 並且我認為這種用法是安全的,因為我正在控制 ThreadDeath 通過同步拋出的位置並處理它以及唯一損壞的對象可能是我的 Pattern 和 Matcher 實例,無論如何都會被丟棄。 我認為這會破壞 Thread.stop() ,因為我沒有重新拋出錯誤,但我真的不希望線程死掉,只是中止 find() 方法。

到目前為止,我已經設法避免使用這些已棄用的 API 組件,但 Matcher.find() 似乎不可中斷,並且可能需要很長時間才能返回。 有沒有更好的方法來做到這一點?

來自 Heritrix:( crawler.archive.org

/**
 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * 
 * @author gojomo
 */
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        super();
        this.inner = inner;
    }

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        }
        // counter++;
        return inner.charAt(index);
    }

    public int length() {
        return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));
    }

    @Override
    public String toString() {
        return inner.toString();
    }
}

用這個包裹你的 CharSequence 線程中斷將起作用......

稍加改動,就可以避免為此使用額外的線程:

public class RegularExpressionUtils {

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
        System.out.println(matcher.matches());
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern());
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final int timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
        }

        public char charAt(int index) {
            if (System.currentTimeMillis() > timeoutTime) {
                throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                                + regularExpression + "' on input '" + stringToMatch + "'!");
            }
            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

非常感謝 dawce 為我指出這個解決方案來回答一個不必要的復雜問題

也許您需要的是一個實現 NFA 算法的新庫。

NFA 算法比 Java 標准庫使用的算法快數百倍。

並且 Java std lib 對輸入正則表達式很敏感,這可能會使您的問題發生 - 某些輸入使 CPU 運行多年。

NFA 算法可以通過它使用的步驟設置超時。 它比線程解決方案更有效。 相信我,我使用線程超時來解決相​​關問題,這對性能來說太可怕了。 我終於通過修改我的算法實現的主循環來解決這個問題。 我在主循環中插入一些檢查點來測試時間。

詳細信息可以在這里找到: https : //swtch.com/~rsc/regexp/regexp1.html

可以使用以下方法停止長時間運行的模式匹配過程。

  • 創建管理模式匹配狀態的StateFulCharSequence類。 當該狀態發生更改時,將在下次調用charAt方法時引發異常。
  • 該狀態更改可以使用ScheduledExecutorService進行ScheduledExecutorService並具有所需的超時時間。
  • 這里模式匹配發生在主線程中,不需要每次都檢查線程中斷狀態。

     public class TimedPatternMatcher { public static void main(String[] args) { ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1); Pattern pattern = Pattern.compile("some regex pattern"); StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence"); Matcher matcher = pattern.matcher(stateFulCharSequence); executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS); try { boolean isMatched = matcher.find(); }catch (Exception e) { e.printStackTrace(); } } /* When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception. */ public static class StateFulCharSequence implements CharSequence, Runnable{ private CharSequence inner; private boolean isTimedOut = false; public StateFulCharSequence(CharSequence inner) { super(); this.inner = inner; } public char charAt(int index) { if (isTimedOut) { throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs")); } return inner.charAt(index); } @Override public int length() { return inner.length(); } @Override public CharSequence subSequence(int start, int end) { return new StateFulCharSequence(inner.subSequence(start, end)); } @Override public String toString() { return inner.toString(); } public void setTimedOut() { this.isTimedOut = true; } @Override public void run() { this.isTimedOut = true; } }}

我包含了一個計數器來檢查 charAt 的每 n 次讀取,以減少開銷。

筆記:

有人說 carAt 可能調用不夠頻繁。 我剛剛添加了 foo 變量以演示調用了多少 charAt,並且它足夠頻繁。 如果您打算在生產中使用它,請刪除該計數器,因為如果在服務器中運行很長時間,它會降低性能並最終導致長時間溢出。 在這個例子中,charAt 每 0.8 秒左右被調用 3000 萬次(沒有用適當的微基准條件測試,這只是一個概念證明)。 如果你想要更高的精度,你可以設置一個較低的 checkInterval,以犧牲性能為代價(從長遠來看,System.currentTimeMillis() > timeoutTime 比 if 子句更昂貴。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.goikosoft.test.RegexpTimeoutException;

/**
 * Allows to create timeoutable regular expressions.
 *
 * Limitations: Can only throw RuntimeException. Decreases performance.
 *
 * Posted by Kris in stackoverflow.
 *
 * Modified by dgoiko to  ejecute timeout check only every n chars.
 * Now timeout < 0 means no timeout.
 *
 * @author Kris https://stackoverflow.com/a/910798/9465588
 *
 */
public class RegularExpressionUtils {

    public static long foo = 0;

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        long millis = System.currentTimeMillis();
        // This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
        try {
            System.out.println(matcher.matches());
        } catch (RuntimeException e) {
            System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
        }
        System.out.print(foo);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
                                                      int checkInterval) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
                                                    long timeoutMillis, int checkInterval) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        }
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern(), checkInterval);
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final long timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        private int checkInterval;

        private int attemps;

        TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
                                  String regularExpression, int checkInterval) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
            this.checkInterval = checkInterval;
            this.attemps = 0;
        }

        public char charAt(int index) {
            if (this.attemps == this.checkInterval) {
                foo++;
                if (System.currentTimeMillis() > timeoutTime) {
                    throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
                }
                this.attemps = 0;
            } else {
                this.attemps++;
            }

            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
                                                regularExpression, checkInterval);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

和自定義異常,所以你只能捕獲那個異常,以避免吞下其他 RE Pattern / Matcher 可能拋出的異常。

public class RegexpTimeoutException extends RuntimeException {
    private static final long serialVersionUID = 6437153127902393756L;

    private final String regularExpression;

    private final String stringToMatch;

    private final long timeoutMillis;

    public RegexpTimeoutException() {
        super();
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message, Throwable cause) {
        super(message, cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message) {
        super(message);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(Throwable cause) {
        super(cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
        super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                + regularExpression + "' on input '" + stringToMatch + "'!");
        this.regularExpression = regularExpression;
        this.stringToMatch = stringToMatch;
        this.timeoutMillis = timeoutMillis;
    }

    public String getRegularExpression() {
        return regularExpression;
    }

    public String getStringToMatch() {
        return stringToMatch;
    }

    public long getTimeoutMillis() {
        return timeoutMillis;
    }

}

基於安德烈亞斯的回答 主要功勞應該歸於他和他的來源。

在使用一個或多個正則表達式模式執行之前檢查用戶提交的正則表達式的“邪惡”模式如何(這可能是在正則表達式的條件執行之前調用的方法的形式):

這個正則表達式:

\(.+\+\)[\+\*]

將匹配:

(a+)+
(ab+)+
([a-zA-Z]+)*

這個正則表達式:

\((.+)\|(\1\?|\1{2,})\)\+

將匹配:

(a|aa)+
(a|a?)+

這個正則表達式:

\(\.\*.\)\{\d{2,}\}

將匹配:

(.*a){x} for x \> 10

我可能對 Regex 和 Regex DoS 有點天真,但我不禁想到,對已知的“邪惡”模式進行一些預篩選對防止執行時出現問題大有幫助,尤其是如果有問題的正則表達式是由最終用戶提供的輸入。 上面的模式可能不夠精致,因為我遠不是正則表達式的專家。 這只是深思熟慮,因為我在那里發現的所有其他內容似乎都表明它無法完成,並且專注於在正則表達式引擎上設置超時,或限制允許執行的迭代次數.

另一種解決方法是限制匹配器的區域,然后調用find() ,重復直到線程中斷或找到匹配。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM