使用Java正则表达式从xml中提取作者

Question

I understand that regex is not ideal for this task. 我知道正则表达式对于此任务并不理想。 But I couldn't use parser since I need preserve the OFFSET . 但是我不能使用解析器，因为我需要保留OFFSET 。 So I have two questions here, one is about regex and other is to extract "author". 所以我在这里有两个问题，一个是关于正则表达式，另一个是提取“作者”。 If you recommend me using any parser, please let me know if there's a parser can preserve the offset. 如果您推荐我使用任何解析器，请告诉我是否有一个解析器可以保留偏移量。 I have xml like this: 我有这样的XML：

<post author="lafeat" datetime="2014-04-03T04:26:00" id="p1">
For legions of young couples, there is no wedding venue more desirable than a barn in the country.
</post>

My code is here: 我的代码在这里：

String regex = "<post\\s*?author=\"(?!\")*\"?.*?>.*?</post>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while (m.find()) {
    System.out.println("start from: " + m.start());
    System.out.println("end to: " + m.end());
    System.out.println("the text is: " + text.substring(m.start(), m.end()));
}

But I didn't get anything back from this regex? 但是我从这个正则表达式中没有得到任何回报吗？ Any suggestion will be great thank. 任何建议将非常感谢。

Answer 1

Using a dedicated HTML parser is better than any regex you can come up with. 使用专用的HTML解析器比您能想到的任何正则表达式都要好。

To answer your question: 要回答您的问题：

A negative lookahead isn't required here. 此处不需要负前瞻。 It's being used incorrectly anyway: 无论如何，它使用不正确：

You cannot apply quantifiers on zero-width assertions, ie you can't do this: (?!\\")* . This is because the preceding token, the zero-width negative lookahead expression, is not quantifiable. 您不能对零宽度的断言应用量词，即，您不能这样做： (?!\\")* ，这是因为前面的标记（零宽度的负前瞻表达式）不可量化。
You're not traversing through the string. 您没有遍历字符串。 As your regex is currently written, it only checks a single position. 由于您的正则表达式当前正在编写，因此它仅检查一个位置。 It's important to note that lookaround assertions are zero-width — it doesn't match any characters. 重要的是要注意环视断言是零宽度的-它与任何字符都不匹配。 So in order to have all the characters from the first double-quote to the next captured, you will have to actually match the text. 因此，为了捕获从第一个双引号到下一个双引号的所有字符，您将必须实际匹配文本。 You can use a dot for this purpose: (?:(?!\\").)* . It will advance through the string character by character until it reaches a position that is followed by a double-quote. 您可以为此使用一个点： (?:(?!\\").)* 。它会逐个字符地在字符串中前进，直到到达其后是双引号的位置。

This is how you should write the expression ( see demo ): 这是您应该如何编写表达式的方法（ 请参见demo ）：

<post\\s*?author=\"((?:(?!\").)*).*?>

But it doesn't need to be that complicated. 但这并不需要那么复杂。 You can just use a negated character class and be done with it ( see demo ): 您可以只使用否定的字符类并对其进行处理（ 请参见demo ）：

<post\\s*?author=\"([^\"]+)\".*?>

\\"([^\\"]+)\\" is a negated character class that matches any characters other than a double-quote, one or more times. \\"([^\\"]+)\\"是一个否定的字符类，它一次或多次匹配除双引号以外的任何字符。

Answer 2

You're not getting anything back because you're using a Negative Lookahead incorrectly and no capturing group. 您没有得到任何回报，因为您错误地使用了负前瞻且没有捕获组。 If you want to extract author , use a capturing group. 如果要提取作者，请使用捕获组。

String regex = "<post\\s*author=\"([^\"]+)\"[^>]+>[^><]+</post>";

And then return the matched group here: 然后在此处返回匹配的组：

while (m.find()) {
    System.out.println("start from: " + m.start());
    System.out.println("end to: " + m.end());
    System.out.println("the text is: " + m.group(1));
}

使用Java正则表达式从xml中提取作者

问题描述

2 个解决方案

解决方案1
2 2014-08-05 14:14:48

解决方案2
2 已采纳 2014-08-05 14:15:22

使用Java正则表达式从xml中提取作者

问题描述

2 个解决方案

解决方案1 2 2014-08-05 14:14:48

解决方案2 2 已采纳 2014-08-05 14:15:22

解决方案1
2 2014-08-05 14:14:48

解决方案2
2 已采纳 2014-08-05 14:15:22