[英]Parsing html with cyberneko to find a 'div'-tag
I need one specific 'div'-tag (identified by 'id') from a html site. 我需要一个来自html网站的特定“ div”标签(由“ id”标识)。 To parse the page I'm using cyberneko. 为了解析页面,我正在使用cyberneko。
def doc = new XmlParser( new org.cyberneko.html.parsers.SAXParser() ).parse(htmlFile)
divTag = doc.depthFirst().DIV.find{ it['@id'] == tagId }
So far no problem, but at the end I don't need XML, but the original content of the whole 'div' tag. 到目前为止没有问题,但是最后我不需要XML,而是整个'div'标签的原始内容。 Unfortunatly I can't figure out how to do this... 不幸的是我不知道该怎么做...
EDIT: Responding to first comment. 编辑:回应第一条评论。
This works: 这有效:
def html = """
<body>
<div id="breadcrumbs">
<b>
crumb1
</b>
</div>
</body>
"""
def doc = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser()).parseText(html)
divTag = doc.BODY.DIV.find { it.@id == 'breadcrumbs' }
println "" << new groovy.xml.StreamingMarkupBuilder().bind {xml -> xml.mkp.yield divTag}
It looks like cyberneko will return a well formed HTML document, regardless of whether the original markup was. 无论原始标记是否存在,cyberneko都将返回格式正确的HTML文档。 ie, doc's root will be a HTML element, and there will also be a HEAD element. 即doc的根将是HTML元素,并且还将有HEAD元素。 Neat. 整齐。
This is a simple test based on noah's answer - unfortunatly it does not (yet) work :( 这是一个基于诺亚回答的简单测试-不幸的是,它还没有运行:(
def html = """
<body>
<div id="breadcrumbs">
<b>
crumb1
</b>
</div>
</body>
"""
def doc = new XmlSlurper( new org.cyberneko.html.parsers.SAXParser() ).parseText(html)
println "document: $doc"
def htmlTag = doc.DIV.find {
println "-> $it"
it['@id'] == "breadcrumbs"
}
println htmlTag
assert htmlTag
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.