簡體   English   中英

如何從 xml 文件中讀取 html 標簽?

[英]How can I read html tags from within an xml file?

我有一個 xml 文件,我正在用 java 代碼讀取它。 我正在閱讀的片段和代碼如下:

 <?xml version="1.0" encoding="UTF-8"?>
 <caml:MeasureDoc version="1.0" xsi:schemaLocation="http://lc.ca.gov/legalservices/schemas/caml.1# xca.1.xsd"
     xmlns:caml="http://lc.ca.gov/legalservices/schemas/caml.1#"
     xmlns:xlink="http://www.w3.org/1999/xlink"
     xmlns:xhtml="http://www.w3.org/1999/xhtml"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
     <caml:BudgetItem id="id_6D1BA0B6-8097-43E3-8A48-13249E6CAD6B" num="2240-002-0890">
         <caml:Content>
             <table cellspacing="0" class="Abutted" id="id_8C3F2551-7554-4A16-9256-0B408C6CD7BB" width="416">
                 <tbody>
                     <tr style="keep-together.within-page:always;">
                         <td colspan="7" valign="top" width="336">
                             <p class="Stub">
                                 <caml:NumSpan>2240-002-0890</caml:NumSpan>—For state operations, Department of Housing and Community Development, payable from the Federal Trust Fund.
                                 <span class="DottedLeaders"/>
                             </p>
                          </td>
                          <td align="right" valign="bottom" width="80">0</td>
                      </tr>
                      <tr style="keep-with-next.within-page:always;">
                          <td valign="top" width="24"/>
                          <td colspan="7" valign="top" width="392">Schedule:</td>
                      </tr>
                  </tbody>
             </table>
         </caml:Content>
     <caml:BudgetItem>
 </caml:MeasureDoc>

代碼:

 import javax.xml.parsers.DocumentBuilderFactory; // etc, etc.
 ...
 DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
 DocumentBuilder builder = factory.newDocumentBuilder();
 ByteArrayInputStream input = new ByteArrayInputStream(billXml.getBytes("UTF-8"));
 Document doc = builder.parse(input);
 Element root = doc.getDocumentElement();
 Node bill = LU.subNodeWithName(root, "caml:Bill");
 Node budgetInfoNode = LU.findBudgetInfoNode(bill); // (my helper method)
 Node contentNode = budgetInfoNode.getChildNodes().item(0);
 Node tableNode = contentNode.getChildNodes().item(0);
 System.out.println(tableNode.toString());

輸出:

 [table: null]

如果我得到表的 getTextContent(),我得到:

 2240-002-0890?For state operations, Department of Housing and Community Development, payable from the
 Federal Trust Fund.0Schedule:(1)1665-Financial Assistance Program0Provisions:1.The funds appropriated
 in this item shall be made available to administer the State Rental Assistance Program.2.Upon order of the
 Department of Finance, amounts transferred to this item may be transferred to Schedule (1) of
 Item 2240-102-0890.3.Any amounts transferred to Schedule (1) of this item pursuant to Provision 2 of
 Item 2240-102-0890 shall be available for encumbrance and expenditure until June 30, 2022.

這些都不是我想要的。 我想要 XML 節點中的 html。

似乎沒有像“getTextContent”方法那樣的“getRealContent”方法,而是顯示標簽。 抱歉,如果我遺漏了一些明顯的東西。

如何讀取表格標簽和其中的標簽?

如果有人知道要設置的屬性以使其停止,則獎勵。 我一遍又一遍地看到這個:

JAXP: find factoryId =javax.xml.transform.TransformerFactory
JAXP: found system property, value=org.apache.xalan.processor.TransformerFactoryImpl
JAXP: created new instance of class org.apache.xalan.processor.TransformerFactoryImpl using ClassLoader: null

不幸的是 XMLProperties.ShutTheHeckUpAlready 不存在。 更可惜。

這可能不是很直觀的解決方案,但是如果我們將所需的節點對象轉換為文檔並應用轉換將其轉換為字符串,我們可以獲得 html。

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
ByteArrayInputStream input = new ByteArrayInputStream(billXml.getBytes("UTF-8"));
Document doc = builder.parse(input);
Element root = doc.getDocumentElement();
NodeList budgetItem = root.getElementsByTagName("caml:BudgetItem");
for (int temp = 0; temp < budgetItem.getLength(); temp++) {
    Node node = budgetItem.item(temp);
    if (node.getNodeType() == Node.ELEMENT_NODE) {
        Element eElement = (Element) node;
        NodeList table = eElement.getElementsByTagName("table");
        Node item = table.item(0);

        String content = getHTMLContent(factory, item);
        System.out.println(content);

    }
}

private static String getHTMLContent(DocumentBuilderFactory factory, Node item) throws ParserConfigurationException, TransformerException {
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document newDocument = builder.newDocument();
    Node importedNode = newDocument.importNode(item, true);
    newDocument.appendChild(importedNode);

    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "html");

    StreamResult result = new StreamResult(new StringWriter());

    DOMSource source = new DOMSource(newDocument);
    transformer.transform(source, result);
    return result.getWriter().toString();
}    

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM