简体   繁体   中英

Cancelling a long running regex match?

Say I'm running a service where users can submit a regex to search through lots of data. If the user submits a regex that is very slow (ie. takes minutes for Matcher.find() to return), I want a way to cancel that match. The only way I can think of doing this is to have another thread monitor how long a match is taking and use Thread.stop() to cancel it if necessary.

Member variables:

long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;

Matcher thread:

try {
    matcherThread = Thread.currentThread();

    // imagine code to start monitor thread is here

    try {
        matched = matcher.find();
    } finally {
        synchronized (lock) {
            finished = true;
            lock.notifyAll();
        }
    }
} catch (ThreadDeath td) {
    // send angry message to client
    // handle error without rethrowing td
}

Monitor thread:

synchronized (lock) {
    while (! finished) {
        try {
            lock.wait(REGEX_TIMEOUT);

            if (! finished) {
                matcherThread.stop();
            }
        } catch (InterruptedException ex) {
            // ignore, top level method in dedicated thread, etc..
        }
    }
}

I've read java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html and I think this usage is safe since I'm controlling where ThreadDeath is thrown via synchronisation and handle it and the only damaged objects could be my Pattern and Matcher instances which will be discarded anyway. I think this breaks Thread.stop() because I'm not rethrowing the error, but I don't really want the thread to die, just abort the find() method.

I've managed to avoid using these deprecated API components so far, but Matcher.find() does not seem to be interruptible and can take a very long time to return. Is there any better way to do this?

From Heritrix: ( crawler.archive.org )

/**
 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * 
 * @author gojomo
 */
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        super();
        this.inner = inner;
    }

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        }
        // counter++;
        return inner.charAt(index);
    }

    public int length() {
        return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));
    }

    @Override
    public String toString() {
        return inner.toString();
    }
}

Wrap your CharSequence with this one and Thread interrupts will work ...

With a little variation it is possible to avoid using additional threads for this:

public class RegularExpressionUtils {

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
        System.out.println(matcher.matches());
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern());
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final int timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
        }

        public char charAt(int index) {
            if (System.currentTimeMillis() > timeoutTime) {
                throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                                + regularExpression + "' on input '" + stringToMatch + "'!");
            }
            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

Thanks a lot to dawce for pointing me to this solution in answer to an unnecessary complicated question !

Maybe what you need is a new lib which implements the NFA algorithm.

The NFA algorithm is hundreds times faster than the algorithm which is used by Java standard library.

And the Java std lib is sensitive to the input regexp, which may make your problem happen -- some input make the CPU run for years.

And the timeout can be set by the NFA algorithm through the steps it uses. It is effective than the Thread solution. Trust me I use thread timeout to a relative problem, it is horrible for performance. I finally fix the problem by modify the main loop of my algorithm implement. I insert some check point to the main loop to test the time.

The detail can be found here: https://swtch.com/~rsc/regexp/regexp1.html .

A long-running pattern matching process can be stopped using the below method.

  • Create StateFulCharSequence class which manages the state of pattern matching. When that state is changed, an exception is thrown on the next call to charAt method.
  • That state change can be scheduled using ScheduledExecutorService with a required timeout.
  • Here pattern matching is happening in the main thread and there is no need to check the thread interrupt state every time.

     public class TimedPatternMatcher { public static void main(String[] args) { ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1); Pattern pattern = Pattern.compile("some regex pattern"); StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence"); Matcher matcher = pattern.matcher(stateFulCharSequence); executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS); try { boolean isMatched = matcher.find(); }catch (Exception e) { e.printStackTrace(); } } /* When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception. */ public static class StateFulCharSequence implements CharSequence, Runnable{ private CharSequence inner; private boolean isTimedOut = false; public StateFulCharSequence(CharSequence inner) { super(); this.inner = inner; } public char charAt(int index) { if (isTimedOut) { throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs")); } return inner.charAt(index); } @Override public int length() { return inner.length(); } @Override public CharSequence subSequence(int start, int end) { return new StateFulCharSequence(inner.subSequence(start, end)); } @Override public String toString() { return inner.toString(); } public void setTimedOut() { this.isTimedOut = true; } @Override public void run() { this.isTimedOut = true; } }}

I included a counter in order to check every n reads of charAt, in order to reduce the overhead.

Notes:

Some people stated that carAt may not be call frequently enough. I just added the foo variable in order to demostrate how much charAt is called, and that it is frequent enough. If you're going to use this in production, remove that counter, as it will decrease performance and end up overflowing long if ran in a server for long time. In this example, charAt is called 30 million times every 0.8 secs or so ( not tested with proper microbenchmarking conditions, it is just a proof of concept ). You can set a lower checkInterval if you want higher precission, at the cost of performance (System.currentTimeMillis() > timeoutTime is more expensive than the if clause on the long run.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.goikosoft.test.RegexpTimeoutException;

/**
 * Allows to create timeoutable regular expressions.
 *
 * Limitations: Can only throw RuntimeException. Decreases performance.
 *
 * Posted by Kris in stackoverflow.
 *
 * Modified by dgoiko to  ejecute timeout check only every n chars.
 * Now timeout < 0 means no timeout.
 *
 * @author Kris https://stackoverflow.com/a/910798/9465588
 *
 */
public class RegularExpressionUtils {

    public static long foo = 0;

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        long millis = System.currentTimeMillis();
        // This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
        try {
            System.out.println(matcher.matches());
        } catch (RuntimeException e) {
            System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
        }
        System.out.print(foo);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
                                                      int checkInterval) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
                                                    long timeoutMillis, int checkInterval) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        }
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern(), checkInterval);
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final long timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        private int checkInterval;

        private int attemps;

        TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
                                  String regularExpression, int checkInterval) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
            this.checkInterval = checkInterval;
            this.attemps = 0;
        }

        public char charAt(int index) {
            if (this.attemps == this.checkInterval) {
                foo++;
                if (System.currentTimeMillis() > timeoutTime) {
                    throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
                }
                this.attemps = 0;
            } else {
                this.attemps++;
            }

            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
                                                regularExpression, checkInterval);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

And the custom exception, so you can catch only THAT exception to avoid swalowing other RE Pattern / Matcher may throw.

public class RegexpTimeoutException extends RuntimeException {
    private static final long serialVersionUID = 6437153127902393756L;

    private final String regularExpression;

    private final String stringToMatch;

    private final long timeoutMillis;

    public RegexpTimeoutException() {
        super();
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message, Throwable cause) {
        super(message, cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message) {
        super(message);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(Throwable cause) {
        super(cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
        super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                + regularExpression + "' on input '" + stringToMatch + "'!");
        this.regularExpression = regularExpression;
        this.stringToMatch = stringToMatch;
        this.timeoutMillis = timeoutMillis;
    }

    public String getRegularExpression() {
        return regularExpression;
    }

    public String getStringToMatch() {
        return stringToMatch;
    }

    public long getTimeoutMillis() {
        return timeoutMillis;
    }

}

Based on Andreas' answer . Main credits should go for him and his source.

What about checking the user-submitted regex for "evil" patterns prior to execution using one or more regex patterns (this could be in to form of a method called prior to conditional execution of the regex):

This regex:

\(.+\+\)[\+\*]

Will match:

(a+)+
(ab+)+
([a-zA-Z]+)*

This Regex:

\((.+)\|(\1\?|\1{2,})\)\+

Will match:

(a|aa)+
(a|a?)+

This Regex:

\(\.\*.\)\{\d{2,}\}

Will match:

(.*a){x} for x \> 10

I may be a bit naive wrt Regex and Regex DoS, but I can't help but think that a little pre-screening for known "evil" patterns would go a long way toward preventing issues at execution time, especially if the regex in question is an input provided by an end user. The patterns above are likely not refined enough, since I am far from an expert on regex. It is just food for thought, since everything else I have found out there seems to indicate it can't be done, and focuses on either putting a time-out on the regex engine, or limiting the number of iterations it is allowed to execute.

另一种解决方法是限制匹配器的区域,然后调用find() ,重复直到线程中断或找到匹配。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM