简体   繁体   English

用于HTML分析的Java库

[英]Java library for HTML analysis

(I've seen similar questions, but I think none of them cater to my specific needs, hence...) (我见过类似的问题,但我认为这些问题都不能满足我的特定需求,因此...)

I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. 我想知道是否有一个Java库,用于分析现实世界(阅读不完整,格式错误)的HTML。 By analysis, I mean things like: 通过分析,我的意思是:

  • figuring out the most prominent color in an HTML chunk 找出HTML块中最突出的颜色
  • changing that color to some other color (hence, has to support modification of the HTML as well) 将颜色更改为其他颜色(因此,还必须支持HTML的修改)
  • pruning out unwanted tags 修剪掉不需要的标签
  • fixing up the HTML to result in a well formed HTML snippet 修复HTML以生成格式正确的HTML代码段

Parts of the last two are done by libraries such as Jericho, and jTidy. 最后两个部分由Jericho和jTidy等库完成。 'Plugins' on top of these would be great. 在这些之上的“插件”会很棒。

Thanks in advance! 提前致谢!

You might want to check out TagSoup: 您可能想查看TagSoup:

http://home.ccil.org/~cowan/XML/tagsoup/ http://home.ccil.org/~cowan/XML/tagsoup/

好吧,我先将其整理成有效的XML,然后使用XSLT进行条件深层复制,在该区域中,我将进行最突出的颜色/修剪/所需的任何处理。

Take a look at JTidy , a Java port of HTML Tidy . 看一下JTidy ,它是HTML Tidy的Java端口。 It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up. 根据您选择的选项,它将修复格式不正确的HTML并进行清理。

You'll need something else for the colour changing stuff. 您还需要其他东西来改变颜色。

也许您会在此列表中找到某些内容(尝试使用TagSoup,NekoHTML,VietSpider HTMLParser)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM