简体   繁体   English

java dom xml解析器获取html标签( <p color=“something”> 一些文字 </p> )来自xml

[英]java dom xml parser get html tags(<p color=“something”>some text</p>) from xml

I have an xml file with html tags like: 我有一个带有html标签的xml文件,例如:

<?xml version="1.0" encoding="utf-8" ?>
 <blog>
 <blogid>49</blogid>
 <title>[FIXED] Job requests page broken</title> 
 <fulltext>
 <img title="page broken" src="images/west/blog/site-broken.jpg" alt="page broken" />
 <p><span style="background-color: #ccffcc;">Update 28/05/2011</span>: Job requests page seems to be working OK now. If you find any issues please use the contact page to notify us. Thank you for your patience!</p>
<p>Â </p>
 <p>Well, what can I say? Why does it always have to be that way? You are trying to create something new and something else gets broken on the way...</p>
 </fulltext>

Now I want the whole html part between tag as it is. 现在,我希望标记之间的整个html部分保持原样。 What I get right now is blank as I think dom is parsing html tags as well. 我现在得到的是空白,因为我认为dom也正在解析html标签。

I tried xpath but it is not working with android. 我尝试了xpath,但它不适用于android。

use a library like Jsoup for this purpose. 为此使用诸如Jsoup之类的库。

public static void main(String args[]){

    String html = "<?xml version="1.0"?><foo>" + 
                  "<bar>Some&nbsp;text &mdash; invalid!</bar></foo>";
    Document doc = Jsoup.parse(html, "", Parser.xmlParser());

    for (Element e : doc.select("bar")) {
        System.out.println(e);
    }   


}

I don't think you can get this not well-formed XML into a DOM as-is. 我认为您无法将这种格式不正确的XML保留为DOM。 (EDIT: or is it well-formed?) (编辑:还是格式正确?)

You would need to a) either escape the characters - making the XML well-formed and parseable (but probably not into a DOM you want, I guess you want to display the HTML in a different system) or b) parse it using a stream processor or c) fix it using string manipulation (add <[[CDATA .. ]]>) and then parse it into a DOM. 您可能需要a)逃逸字符-使XML格式正确且可解析(但可能无法转换成所需的DOM,我想您想在不同的系统中显示HTML)或b)使用流对其进行解析处理器或c)使用字符串操作(添加<[[[CDATA ..]]>)对其进行修复,然后将其解析为DOM。

HTH 高温超导

HTML is a sub-language of XML (without getting into details related to XHTML). HTML是XML的子语言(无需深入了解与XHTML相关的细节)。 Therefore, there is no reason for the DOM parser not to treat those inner tags as XML tags. 因此,没有理由让DOM解析器不将那些内部标签视为XML标签。

Maybe what you're looking for is a way to flatten what's inside <fulltext> ? 也许您正在寻找的是一种扁平化<fulltext>内部内容的方法?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM