简体   繁体   English

正则表达式忽略引号之间的文本

[英]RegEx To Ignore Text Between Quotes

I have a Regex, which is [\\\\.|\\\\;|\\\\?|\\\\!][\\\\s] 我有一个正则表达式,它是[\\\\.|\\\\;|\\\\?|\\\\!][\\\\s]
This is used to split a string. 这用于分割字符串。 But I don't want it to split . ; ? ! 但是我不想分裂. ; ? ! . ; ? ! if it is in quotes. 如果用引号引起来。

I'd not use split but Pattern & Matcher instead. 我不会使用split而是Pattern&Matcher。

A demo: 演示:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static void main(String[] args) {

        String text = "start. \"in quotes!\"; foo? \"more \\\" words\"; bar";

        String simpleToken = "[^.;?!\\s\"]+";

        String quotedToken =
                "(?x)             # enable inline comments and ignore white spaces in the regex         \n" +
                "\"               # match a double quote                                                \n" +
                "(                # open group 1                                                        \n" +
                "  \\\\.          #   match a backslash followed by any char (other than line breaks)   \n" +
                "  |              #   OR                                                                \n" +
                "  [^\\\\\r\n\"]  #   any character other than a backslash, line breaks or double quote \n" +
                ")                # close group 1                                                       \n" +
                "*                # repeat group 1 zero or more times                                   \n" +
                "\"               # match a double quote                                                \n";

        String regex = quotedToken + "|" + simpleToken;

        Matcher m = Pattern.compile(regex).matcher(text);

        while(m.find()) {
            System.out.println("> " + m.group());
        }
    }
}

which produces: 产生:

> start
> "in quotes!"
> foo
> "more \" words"
> bar

As you can see, it can also handle escaped quotes inside quoted tokens. 如您所见,它还可以处理带引号的令牌中的转义引号。

Here is what I do in order to ignore quotes in matches. 这是我为了忽略比赛中的引号而要做的事情。

(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*?    # <-- append the query you wanted to search for - don't use something greedy like .* in the rest of your regex.

To adapt this for your regex, you could do 为了适应您的正则表达式,您可以

(?:[^\"\']|(?:\".*?\")|(?:\'.*?\'))*?[.;?!]\s*

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM