简体   繁体   English

将正则表达式组匹配到java(Hearst Pattern)列表

[英]Match regex groups to list in java (Hearst Pattern)

I'm trying to match Hearst-Patterns with Java regex this is my regex: 我试图将Hearst-Patterns与Java正则表达式匹配这是我的正则表达式:

<np>(\w+)<\/np> such as (?:(?:, | or | and )?<np>(\w+)<\/np>)*

If I have a annotated sentence like: 如果我有一个带注释的句子,如:

I have a <np>car</np> such as <np>BMW</np>, <np>Audi</np> or <np>Mercedes</np> and this can drive fast.

I want to get the groups: 我想得到这些团体:

1. car
2. [BMW, Audi, Mercedes]

UPDATE: Here is my current java code: 更新:这是我目前的java代码:

Pattern pattern = Pattern.compile("<np>(\\w+)<\\/np> such as (?:(?:, | or | and )?<np>(\\w+)<\\/np>)*");
Matcher matcher = pattern.matcher("I have a <np>car</np> such as <np>BMW</np>, <np>Audi</np> or <np>Mercedes</np> and this can drive fast.");

while (matcher.find()) {
    System.out.println(matcher.group(1));
    System.out.println(matcher.group(2));
}

But the 2nd group element only contains Mercedes , how can I get all the matches for the 2nd group (maby as array)? 但是第二组元素只包含Mercedes ,我如何获得第二组的所有匹配(maby作为数组)? Is this possible with java Pattern and Matcher ? 这可能与Java PatternMatcher And if yes, what is my mistake? 如果是的话,我的错误是什么?

If you want to be sure to have contiguous results, you can use the \\G anchor that forces a match to be contiguous to a precedent match: 如果你想确保有连续的结果,你可以使用强制匹配的先前匹配的\\G锚点:

Pattern p = Pattern.compile("<np>(\\w+)</np> such as|\\G(?:,| or| and)? <np>(\\w+)</np>");

note: the \\G anchor means the end of a precedent match or the start of the string. 注意: \\G锚意味着先前匹配的结束或字符串的开始。 To avoid to match the start of the string, you can add the lookbehind (?<!^) after the \\G 为避免匹配字符串的开头,可以在\\G之后添加lookbehind (?<!^)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM