简体   繁体   English

如何从 xml 文件中读取 html 标签?

[英]How can I read html tags from within an xml file?

I have an xml file that I am reading with java code.我有一个 xml 文件,我正在用 java 代码读取它。 A fragment of what I am reading and the code is below:我正在阅读的片段和代码如下:

 <?xml version="1.0" encoding="UTF-8"?>
 <caml:MeasureDoc version="1.0" xsi:schemaLocation="http://lc.ca.gov/legalservices/schemas/caml.1# xca.1.xsd"
     xmlns:caml="http://lc.ca.gov/legalservices/schemas/caml.1#"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     xmlns:xhtml="http://www.w3.org/1999/xhtml"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
     <caml:BudgetItem id="id_6D1BA0B6-8097-43E3-8A48-13249E6CAD6B" num="2240-002-0890">
         <caml:Content>
             <table cellspacing="0" class="Abutted" id="id_8C3F2551-7554-4A16-9256-0B408C6CD7BB" width="416">
                 <tbody>
                     <tr style="keep-together.within-page:always;">
                         <td colspan="7" valign="top" width="336">
                             <p class="Stub">
                                 <caml:NumSpan>2240-002-0890</caml:NumSpan>—For state operations, Department of Housing and Community Development, payable from the Federal Trust Fund.
                                 <span class="DottedLeaders"/>
                             </p>
                          </td>
                          <td align="right" valign="bottom" width="80">0</td>
                      </tr>
                      <tr style="keep-with-next.within-page:always;">
                          <td valign="top" width="24"/>
                          <td colspan="7" valign="top" width="392">Schedule:</td>
                      </tr>
                  </tbody>
             </table>
         </caml:Content>
     <caml:BudgetItem>
 </caml:MeasureDoc>

java code:代码:

 import javax.xml.parsers.DocumentBuilderFactory; // etc, etc.
 ...
 DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
 DocumentBuilder builder = factory.newDocumentBuilder();
 ByteArrayInputStream input = new ByteArrayInputStream(billXml.getBytes("UTF-8"));
 Document doc = builder.parse(input);
 Element root = doc.getDocumentElement();
 Node bill = LU.subNodeWithName(root, "caml:Bill");
 Node budgetInfoNode = LU.findBudgetInfoNode(bill); // (my helper method)
 Node contentNode = budgetInfoNode.getChildNodes().item(0);
 Node tableNode = contentNode.getChildNodes().item(0);
 System.out.println(tableNode.toString());

output:输出:

 [table: null]

if I get the table's getTextContent(), I get:如果我得到表的 getTextContent(),我得到:

 2240-002-0890?For state operations, Department of Housing and Community Development, payable from the
 Federal Trust Fund.0Schedule:(1)1665-Financial Assistance Program0Provisions:1.The funds appropriated
 in this item shall be made available to administer the State Rental Assistance Program.2.Upon order of the
 Department of Finance, amounts transferred to this item may be transferred to Schedule (1) of
 Item 2240-102-0890.3.Any amounts transferred to Schedule (1) of this item pursuant to Provision 2 of
 Item 2240-102-0890 shall be available for encumbrance and expenditure until June 30, 2022.

Neither of these is what I want.这些都不是我想要的。 I want the html within the XML node.我想要 XML 节点中的 html。

There seems to be no "getRealContent" method like the "getTextContent" method, but showing the tags.似乎没有像“getTextContent”方法那样的“getRealContent”方法,而是显示标签。 Apologies if I am missing something obvious.抱歉,如果我遗漏了一些明显的东西。

How can I read the table tag and the tags within it?如何读取表格标签和其中的标签?

Bonus if anyone knows the property to set to get this to stop.如果有人知道要设置的属性以使其停止,则奖励。 I am seeing this over and over and over and over again:我一遍又一遍地看到这个:

JAXP: find factoryId =javax.xml.transform.TransformerFactory
JAXP: found system property, value=org.apache.xalan.processor.TransformerFactoryImpl
JAXP: created new instance of class org.apache.xalan.processor.TransformerFactoryImpl using ClassLoader: null

Unfortunately XMLProperties.ShutTheHeckUpAlready does not exist.不幸的是 XMLProperties.ShutTheHeckUpAlready 不存在。 More's the pity.更可惜。

This may not be very intuitive solution but if we convert the required node object to document and apply transform to convert this to string, we can get the html.这可能不是很直观的解决方案,但是如果我们将所需的节点对象转换为文档并应用转换将其转换为字符串,我们可以获得 html。

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
ByteArrayInputStream input = new ByteArrayInputStream(billXml.getBytes("UTF-8"));
Document doc = builder.parse(input);
Element root = doc.getDocumentElement();
NodeList budgetItem = root.getElementsByTagName("caml:BudgetItem");
for (int temp = 0; temp < budgetItem.getLength(); temp++) {
    Node node = budgetItem.item(temp);
    if (node.getNodeType() == Node.ELEMENT_NODE) {
        Element eElement = (Element) node;
        NodeList table = eElement.getElementsByTagName("table");
        Node item = table.item(0);

        String content = getHTMLContent(factory, item);
        System.out.println(content);

    }
}

private static String getHTMLContent(DocumentBuilderFactory factory, Node item) throws ParserConfigurationException, TransformerException {
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document newDocument = builder.newDocument();
    Node importedNode = newDocument.importNode(item, true);
    newDocument.appendChild(importedNode);

    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "html");

    StreamResult result = new StreamResult(new StringWriter());

    DOMSource source = new DOMSource(newDocument);
    transformer.transform(source, result);
    return result.getWriter().toString();
}    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM