简体   繁体   English

如何使用正则表达式解析Java中的HTML?

[英]How to use regular expressions to parse HTML in Java?

Please can someone tell me a simple way to find href and src tags in an html file using regular expressions in Java? 有人可以告诉我一个简单的方法在Java中使用正则表达式在html文件中找到href和src标签吗?
And then, how do I get the URL associated with the tag? 然后,如何获取与标记关联的URL?

Thanks for any suggestion. 谢谢你的任何建议。

Using regular expressions to pull values from HTML is always a mistake. 使用正则表达式从HTML中提取值总是一个错误。 HTML syntax is a lot more complex that it may first appear and it's very easy for a page to catch out even a very complex regular expression. 它可能首先出现的HTML语法要复杂得多,而且即使是非常复杂的正则表达式,页面也很容易识别出来。

Use an HTML Parser instead. 请改用HTML Parser See also What are the pros and cons of the leading Java HTML parsers? 另请参阅主要Java HTML解析器的优缺点是什么?

The other answers are true. 其他答案都是真的。 Java Regex API is not a proper tool to achieve your goal. Java Regex API不是实现目标的合适工具。 Use efficient, secure and well tested high-level tools mentioned in the other answers. 使用其他答案中提到的高效,安全且经过良好测试的高级工具。

If your question concerns rather Regex API than a real-life problem (learning purposes for example) - you can do it with the following code: 如果您的问题涉及Regex API而不是现实生活中的问题(例如学习目的) - 您可以使用以下代码执行此操作:

String html = "foo <a href='link1'>bar</a> baz <a href='link2'>qux</a> foo";
Pattern p = Pattern.compile("<a href='(.*?)'>");
Matcher m = p.matcher(html);
while(m.find()) {
   System.out.println(m.group(0));
   System.out.println(m.group(1));
}

And the output is: 输出是:

<a href='link1'>
link1
<a href='link2'>
link2

Please note that lazy/reluctant qualifier *? 请注意懒惰/不情愿的资格赛*? must be used in order to reduce the grouping to the single tag. 必须使用以减少分组到单个标记。 Group 0 is the entire match, group 1 is the next group match (next pair of parenthesis). 组0是整个匹配,组1是下一组匹配(下一对括号)。

不要使用正则表达式使用NekoHTML或TagSoup,这是一个提供SAX或DOM的桥梁,就像在XML方法中访问HTML文档一样。

If you want to go down the html parsing route, which Dave and I recommend here's the code to parse a String Data for anchor tags and print their href. 如果你想沿着html解析路线走下去,Dave和我推荐这里的代码来解析锚点标签的字符串数据并打印它们的href。

since your just using anchor tags you should be ok with just regex but if you want to do more go with a parser. 因为你只是使用锚标签,你应该只使用正则表达式,但如果你想做更多,请使用解析器。 The Mozilla HTML Parser is the best out there. Mozilla HTML Parser是最好的。

File parserLibraryFile = new File("lib/MozillaHtmlParser/native/bin/MozillaParser" + EnviromentController.getSharedLibraryExtension());
                String parserLibrary = parserLibraryFile.getAbsolutePath();
                //  mozilla.dist.bin directory :
                final File mozillaDistBinDirectory = new File("lib/MozillaHtmlParser/mozilla.dist.bin."+ EnviromentController.getOperatingSystemName());

        MozillaParser.init(parserLibrary,mozillaDistBinDirectory.getAbsolutePath());
MozillaParser parser = new MozillaParser();
Document domDocument = parser.parse(data);
NodeList list = domDocument.getElementsByTagName("a");

for (int i = 0; i < list.getLength(); i++) {
    Node n = list.item(i);
    NamedNodeMap m = n.getAttributes();
    if (m != null) {
        Node attrNode = m.getNamedItem("href");
        if (attrNode != null)
           System.out.println(attrNode.getNodeValue());

Contrary to popular opinion, regular expressions are useful tools to extract data from unstructured text (which HTML is). 与流行的观点相反,正则表达式是从非结构化文本(HTML是)中提取数据的有用工具。

If you are doing complex HTML data extraction (say, find all paragraphs in a page) then HTML parsing is probably the way to go. 如果您正在进行复杂的HTML数据提取(例如,查找页面中的所有段落),则可能需要进行HTML解析。 But if you just need to get some URLs from HREFs, then a regular expression would work fine and it will be very hard to break it. 但是,如果您只需要从HREF获取一些URL,那么正则表达式将正常工作并且很难打破它。

Try something like this: 尝试这样的事情:

/<a[^>]+href=["']?([^'"> ]+)["']?[^>]*>/i

Regular expressions can only parse regular languages, that's why they are called regular expressions. 正则表达式只能解析常规语言,这就是它们被称为正则表达式的原因。 HTML is not a regular language, ergo it cannot be parsed by regular expressions. HTML不是常规语言,因此无法通过正则表达式进行解析。

HTML parsers, on the other hand, can parse HTML, that's why they are called HTML parsers. 另一方面,HTML解析器可以解析HTML,这就是为什么它们被称为HTML解析器。

You should use you favorite HTML parser instead. 您应该使用您最喜欢的HTML解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM