简体   繁体   中英

How to exclude occurrence of a substring from a string using regex?

I have a string input in the following two forms.

1.

<!--XYZdfdjf., 15456, hdfv.4002-->
<!DOCTYPE

2.

<!--XYZdfdjf., 15456, hdfv.4002
<!DOCTYPE

I want to return a match if the form 2 is encountered and no match for the form 1. Thus basically I want a regex that accepts arbitrarily all characters between <!-- and <!DOCTYPE , except when there is an occurance of --> in between.

I am using Pattern, Matcher and java regex. Help is sought in terms of a regex specifically usable with Pattern.compile()

Thanks in advance.

Pattern p = Pattern.compile("(?s)<!--(?:(?!-->).)*<!DOCTYPE");

(?:(?.-->).)* matches one character at a time, after checking that it's not the first character of --> .

(?s) sets DOTALL mode (aka single-line mode), allowing the . to match newline characters.

If there's a possibility of two or more matches and you want to find them individually, you can replace the * with a non-greedy *? , like so:

"(?s)<!--(?:(?!-->).)*?<!DOCTYPE"

For example, applying that regex to the text of your question will find two matches, while the original regex will find one, longer match.

This seems like it is easily solved by using String.contains() :

if (yourHtml.contains("-->")) {
    // exclude
} else {
    // extract the content you need
    String content = 
        yourHtml.substring("<!--".length(), yourHtml.indexOf("<!DOCTYPE"));
}

I think you are looking too far into it.

\<!--([\s\S](?!--\>))*?(?=\<\!DOCTYPE)

this uses a negative lookahead to prevent the --> and a positive lookahead to find the <!DOCTYPE Here's a good reference for atomic assertions (lookahead and behind) .

I don't have a testing system handy so i can't give you the regex but you should look inside the Pattern documentation for something called negative lookahead assertion . This allows you to express rules of the form: Match this if not followed by that.

It should help you:)

A regular expression might not be the best answer to your problem. Have you tried splitting the first line away from everything else and seeing if it contains the --> ?

Specifically, something like:

String htmlString;
String firstLine = htmlString.split("\r?\n")[0];
if(firstLine.contains("-->"))
    ;//no match
//match

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM