简体   繁体   English

使用Java正则表达式从xml中提取作者

[英]Using java regular expression to extract author from xml

I understand that regex is not ideal for this task. 我知道正则表达式对于此任务并不理想。 But I couldn't use parser since I need preserve the OFFSET . 但是我不能使用解析器,因为我需要保留OFFSET So I have two questions here, one is about regex and other is to extract "author". 所以我在这里有两个问题,一个是关于正则表达式,另一个是提取“作者”。 If you recommend me using any parser, please let me know if there's a parser can preserve the offset. 如果您推荐我使用任何解析器,请告诉我是否有一个解析器可以保留偏移量。 I have xml like this: 我有这样的XML:

<post author="lafeat" datetime="2014-04-03T04:26:00" id="p1">
For legions of young couples, there is no wedding venue more desirable than a barn in the country.
</post>

My code is here: 我的代码在这里:

String regex = "<post\\s*?author=\"(?!\")*\"?.*?>.*?</post>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while (m.find()) {
    System.out.println("start from: " + m.start());
    System.out.println("end to: " + m.end());
    System.out.println("the text is: " + text.substring(m.start(), m.end()));
}

But I didn't get anything back from this regex? 但是我从这个正则表达式中没有得到任何回报吗? Any suggestion will be great thank. 任何建议将非常感谢。

Using a dedicated HTML parser is better than any regex you can come up with. 使用专用的HTML解析器比您能想到的任何正则表达式都要好。


To answer your question: 要回答您的问题:

A negative lookahead isn't required here. 此处不需要负前瞻。 It's being used incorrectly anyway: 无论如何,它使用不正确:

  1. You cannot apply quantifiers on zero-width assertions, ie you can't do this: (?!\\")* . This is because the preceding token, the zero-width negative lookahead expression, is not quantifiable. 您不能对零宽度的断言应用量词,即,您不能这样做: (?!\\")* ,这是因为前面的标记(零宽度的负前瞻表达式) 不可量化。

  2. You're not traversing through the string. 您没有遍历字符串。 As your regex is currently written, it only checks a single position. 由于您的正则表达式当前正在编写,因此它仅检查一个位置。 It's important to note that lookaround assertions are zero-width — it doesn't match any characters. 重要的是要注意环视断言是零宽度的-它与任何字符都不匹配。 So in order to have all the characters from the first double-quote to the next captured, you will have to actually match the text. 因此,为了捕获从第一个双引号到下一个双引号的所有字符,您将必须实际匹配文本。 You can use a dot for this purpose: (?:(?!\\").)* . It will advance through the string character by character until it reaches a position that is followed by a double-quote. 您可以为此使用一个点: (?:(?!\\").)* 。它会逐个字符地在字符串中前进,直到到达其后是双引号的位置。

This is how you should write the expression ( see demo ): 这是您应该如何编写表达式的方法( 请参见demo ):

<post\\s*?author=\"((?:(?!\").)*).*?>

But it doesn't need to be that complicated. 但这并不需要那么复杂。 You can just use a negated character class and be done with it ( see demo ): 您可以只使用否定的字符类并对其进行处理( 请参见demo ):

<post\\s*?author=\"([^\"]+)\".*?>

\\"([^\\"]+)\\" is a negated character class that matches any characters other than a double-quote, one or more times. \\"([^\\"]+)\\"是一个否定的字符类,它一次或多次匹配除双引号以外的任何字符。

You're not getting anything back because you're using a Negative Lookahead incorrectly and no capturing group. 您没有得到任何回报,因为您错误地使用了负前瞻且没有捕获组。 If you want to extract author , use a capturing group. 如果要提取作者 ,请使用捕获组。

String regex = "<post\\s*author=\"([^\"]+)\"[^>]+>[^><]+</post>";

And then return the matched group here: 然后在此处返回匹配的组:

while (m.find()) {
    System.out.println("start from: " + m.start());
    System.out.println("end to: " + m.end());
    System.out.println("the text is: " + m.group(1));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM