简体   繁体   English

Java 多行字符串的正则表达式模式

[英]Java regex pattern for a multi-line string

I'm working with a simple java regular expression program to check whether a set of string matches a defined regular expression pattern.我正在使用一个简单的 java 正则表达式程序来检查一组字符串是否与定义的正则表达式模式匹配。 I have created a reg-ex pattern but it's showing false when running it.我已经创建了一个 reg-ex 模式,但它在运行时显示为 false。 I need to modify the reg-ex pattern to match the given string.我需要修改 reg-ex 模式以匹配给定的字符串。 Below is my source code:下面是我的源代码:

        String thread = "From: Demo Name\n" +
                "Sent: Wednesday, January 18, 2023 2:56 PM\n" +
                "To: demo@myweb.com <demo@myweb.com>\n" +
                "Subject: Demo Issue";
        String regEX ="((^[a-zA-Z]+[:]\\s.*\\n*?\\n){2,4}.+\\nSubject[:].+\\n)+?";

        Pattern pattern = Pattern.compile("((^[a-zA-Z]+[:]\\s.*\\n*?\\n){2,4}.+\\nSubject[:].+?\\n)+?",
            Pattern.DOTALL | Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
        Matcher matcher = pattern.matcher(thread);
        System.out.println(matcher.find());

When running the program it returns false.运行程序时返回 false。 But it is expected to return true.但它预计会返回 true。 Here in the given strings, the words such as From: , Sent: , To: , and Subject: are constants and won't be changing.在给定的字符串中,诸如From:Sent:To:Subject:之类的词是常量,不会改变。 Need to modify the reg-ex pattern based on the need.需要根据需要修改正则表达式模式。

^ matches start of the entire input unless you enable MULTILINE mode in which case it matches on any 'beginning of line'. ^匹配整个输入的开头,除非您启用 MULTILINE 模式,在这种情况下它匹配任何“行首”。 So, you do want MULTILINE mode, so that your ^[a-zA-Z]+:\\s.* pattern matches headers but not random usages of colon in the middle of the actual text.因此,您确实需要 MULTILINE 模式,以便您的^[a-zA-Z]+:\\s.*模式匹配标题,但不匹配实际文本中间随机使用的冒号。 Note that if someone sticks 'Foo: bar' on its own line in the body you're going to match that too, not much you can do about that just with regexes alone.请注意,如果有人将“Foo: bar”放在正文中自己的行上,您也会匹配它,仅使用正则表达式您无能为力。

You then attempt to match a thing that is supposed to consume 1 header line 2 to 4 times.然后,您尝试匹配一个应该消耗 1 header 行 2 到 4 次的事物。 Then, you need a seemingly arbitrarily injected .+ which will mess you up, as that means Subject can no longer be matched.然后,您需要一个看似任意注入的.+ ,这会使您陷入困境,因为这意味着Subject无法再匹配。 Get rid of that.摆脱那个。 You also have \\n all over the place, far too often.你也到处都是\\n ,太频繁了。 It feels like you just shoved stuff in there praying that if only you add enough, maybe it'll work.感觉就像你只是把东西塞进去祈祷,如果你添加足够多,也许它会起作用。

That's not how you make regexes.那不是你制作正则表达式的方式。 When they don't work make it smaller, not larger - try to match JUST the first line.当它们不起作用时,让它变小,而不是变大——尝试只匹配第一行。 Then expand from there.然后从那里扩展。 Keep going from 'matching' to 'matching' instead of starting at something that doesn't match when you feel like it should and just shoving stuff into the regex futilely.继续从“匹配”到“匹配”,而不是在您觉得应该匹配时从不匹配的东西开始,然后徒劳地将东西推入正则表达式。

The final trickery here is that your input string does not end in a newline, and yet you demand in your regex that the Subject: line ends with a newline.这里的最后一个技巧是您的输入字符串没有以换行符结尾,但您在正则表达式中要求Subject:行以换行符结尾。 It doesn't, so that doesn't work.它没有,所以那是行不通的。 Using ^ and $ does work, as those match on end-of-input too.使用^$确实有效,因为它们也匹配输入结束。

Using that strategy I fixed for your regular expression for you:使用我为您修复的正则表达式的策略:

String regEX ="((^[a-zA-Z]+[:]\\s.*\\n*?\\n){2,4}.+\\nSubject[:].+\\n)+?";
// use flags CASE_INSENSITIVE and MULTILINE but not DOTALL.

Don't use DOTALL - that means .* just eats everything (including the newline, which you don't want).不要使用 DOTALL - 这意味着.*只会吃掉所有东西(包括你不想要的换行符)。

HOWEVER然而

This regex seems to be ill advised though.这个正则表达式似乎是不明智的。 What are you actually trying to accomplish?到底想完成什么? If the input is 'constant', why not just ditch regexes and search for "\nSubject: " instead?如果输入是“常量”,为什么不放弃正则表达式并搜索"\nSubject: "呢? If you're trying to just get rid of all headers, why not search for the double enter that separates headers from the body and eliminate the rest?如果您只想删除所有标头,为什么不搜索将标头与正文分开的双输入并消除 rest?

int headerSplit = in.indexOf("\n\n");
String bodyOnly = in.substring(headerSplit + 2);

If you want a combination of these things, then write that.如果你想要这些东西的组合,那就写吧。 "Put it all in one gigantic regex" is rarely the way to get to easy to maintain code. “将所有内容放在一个巨大的正则表达式中”很少是使代码易于维护的方法。 If this is a full news/mail message, then first find the blank line so you can separate headers from content (after all, Foo: bar is perfectly legal to write in an email message, doesn't mean it has a Foo header,), then if you want to specifically pick up the subject, either write a regex or, you don't really need one:如果这是一条完整的新闻/邮件消息,那么首先找到空白行,这样您就可以将标题与内容分开(毕竟, Foo: bar在 email 消息中写入是完全合法的,并不意味着它有一个Foo header, ),那么如果你想专门选择这个主题,要么写一个正则表达式,要么,你真的不需要一个:

void getSubjectFromEmail(String in) {
  int headerEnd = in.indexOf("\n\n");
  int subject = in.indexOf("Subject: ");
  if (headerEnd != -1 && subject > headerEnd) return null;
  int subjectEnd = in.indexOf('\n', subject);
  return in.substring(subject + "Subject: ".length(), subjectEnd == -1 ? in.length() : subjectEnd);
}

Does it without regular expressions.没有正则表达式。 Regexes aren't 'good' at trying to find that 'end of headers' bit.正则表达式并不“擅长”尝试找到“标头结尾”位。 A hybrid approach, if you prefer that:如果您愿意,可以使用混合方法:

class Test {
  private static final Pattern SUBJECT_FINDER = Pattern.compile("^Subject: (.*)$", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);

  String getSubjectFromEmail(String in) {
    int headerEnd = in.indexOf("\n\n");
    var m = SUBJECT_FINDER.matcher(in);
    if (headerEnd != -1) m.region(0, headerEnd);
    if (!m.find()) return null;
    return m.group(1);
  }

  static final String TEST_TEXT = """
    From: Demo Name
    Sent: Wednesday, January 18, 2023 2:56 PM
    To: demo@myweb.com <demo@myweb.com>
    Subject: Demo Issue"""
    .replace("\r", "");

  void test() {
    String subject = getSubjectFromEmail(TEST_TEXT);
    System.out.println("Subject found: " + subject);
  }

  public static void main(String[] args) {
    new Test().test();
  }
}

Your pattern matches a newline at the end, but there is no newline at the end of the example data.您的模式在末尾匹配换行符,但示例数据末尾没有换行符。

If the constants never change in the string, using \h to match a horizontal whitespace char and \R to match any unicode newline sequence:如果常量在字符串中永远不会改变,则使用\h匹配水平空白字符并使用\R匹配任何 unicode 换行符序列:

^From:\h+.+\RSent:\h+.+\RTo:\h+.+\RSubject:\h+.*

In Java, with Pattern.MULTILINE and Pattern.CASE_INSENSITIVE and doubled backslashes:在 Java 中,使用Pattern.MULTILINEPattern.CASE_INSENSITIVE以及双反斜杠:

String regEX = "^From:\\h+.+\\RSent:\\h+.+\\RTo:\\h+.+\\RSubject:\\h+.*";

Regex101 demo | Regex101 演示| Java demo Java演示


If you want to match 2-4 lines followed by Subject:如果你想匹配 2-4 行后跟主题:

(?:^[a-z]+:\h.*\R){2,4}Subject:.*

In Java, with Pattern.MULTILINE and Pattern.CASE_INSENSITIVE and doubled backslashes:在 Java 中,使用Pattern.MULTILINEPattern.CASE_INSENSITIVE以及双反斜杠:

String regEX = "(?:^[a-z]+:\\h.*\\R){2,4}Subject:.*";

Regex101 demo | Regex101 演示| Java demo Java演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM