简体   繁体   中英

parse meta tags in Java

I have a collection of HTML documents for which I need to parse the contents of the <meta> tags in the <head> section. These are the only HTML tags whose values I'm interested in, ie I don't need to parse anything in the <body> section.

I've attempted to parse these values using the XPath support provided by JDom. However, this isn't working out too well because a lot of the HTML in the <body> section is not valid XML.

Does anyone have any suggestions for how I might go about parsing these tag values in manner that can deal with malformed HTML?

Cheers, Don

You can likely use the Jericho HTML Parser . In particular, have a look at this to see how you can go about finding specific tags.

如果它适合您的应用程序,您可以使用Tidy将HTML转换为有效的XML,然后使用尽可能多的XPath!

JTidy应该为此提供一个良好的起点。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM