简体   繁体   English

将HTML标记解析为XML

[英]Parsing HTML tags into XML

I'm trying to parse XML that's embedded in the HTML file below. 我正在尝试解析下面HTML文件中嵌入的XML。 Here's the detail from one of the tags: 以下是其中一个标签的详细信息:

           DOM<tr class="iris_table_row">
                <td style=" width:37.50%; text-align:left; " class="ta_10"><span class="ta_10">Tangible assets</span></td>
                <td style=" width:2.50%; text-align:right; " class="ta_10"><span class="ta_10">2</span></td>
                <td style=" width:30.00%; text-align:right; " class="ta_61"><ix:nonFraction contextRef="cfwd_31_03_2014" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">7,956</ix:nonFraction></td>
                <td style=" width:1.25%; " class="ta_61" />
                <td style=" width:26.25%; text-align:right; " class="ta_60"><ix:nonFraction contextRef="cfwd_31_03_2013" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">5,402</ix:nonFraction></td>
                <td style=" width:1.25%; " class="ta_60" />
                <td style=" width:1.25%; " class="ta_10" />
            </tr>

I've tried using a DOM parser in java to do this but it doesn't recognize the XML tags. 我尝试在Java中使用DOM解析器来执行此操作,但它无法识别XML标签。

The value of db.parse(fXmlFile) in the code below is "null". 下面的代码中db.parse(fXmlFile)的值为“ null”。

File fXmlFile = new File("Prod223_1254_04903825_20140331 copy.xml");

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setValidating(false);
    dbf.setNamespaceAware(true);
    dbf.setIgnoringComments(false);
    dbf.setIgnoringElementContentWhitespace(false);
    dbf.setExpandEntityReferences(false);
    DocumentBuilder db = dbf.newDocumentBuilder();

    System.out.println(db.parse(fXmlFile));

How can I get the all the tags and information into java? 如何将所有标签和信息放入Java? Ideally I'd be able to load them into a bean. 理想情况下,我可以将它们加载到Bean中。

Here is an example of the type of file I'm trying to parse. 这是我要解析的文件类型的示例。

 <?xml version="1.0" encoding="utf-8"?><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2010-04-20" xmlns:ixt2="http://www.xbrl.org/inlineXBRL/transformation/2011-07-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xl="http://www.xbrl.org/2003/XLink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:iris="http://www.iris.co.uk/ixbrl" xmlns:ns0="http://www.xbrl.org/uk/gaap/core-full/2009-09-01" xmlns:ns5="http://www.xbrl.org/uk/gaap/core/2009-09-01" xmlns:ns6="http://www.xbrl.org/uk/reports/direp/2009-09-01" xmlns:ns7="http://www.xbrl.org/uk/cd/business/2009-09-01" xmlns:ns8="http://www.xbrl.org/uk/all/types/2009-09-01" xmlns:ns9="http://xbrl.org/2005/xbrldt" xmlns:ns10="http://www.xbrl.org/uk/all/common/2009-09-01" xmlns:ns11="http://www.xbrl.org/2006/ref" xmlns:ns12="http://www.xbrl.org/uk/cd/countries/2009-09-01" xmlns:ns13="http://www.xbrl.org/uk/all/ref/2009-09-01" xmlns:ns14="http://www.xbrl.org/uk/cd/currencies/2009-09-01" xmlns:ns15="http://www.xbrl.org/uk/cd/exchanges/2009-09-01" xmlns:ns16="http://www.xbrl.org/uk/cd/languages/2009-09-01" xmlns:ns17="http://www.xbrl.org/2004/ref" xmlns:ns18="http://www.xbrl.org/uk/all/gaap-ref/2009-09-01" xmlns:ns19="http://www.xbrl.org/uk/reports/aurep/2009-09-01" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:ns20="http://www.govtalk.gov.uk/uk/fr/tax/full-gaap-dpl/2013-10-01" xmlns:ns21="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap-main/2013-10-01" xmlns:ns22="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap/2013-10-01" xmlns:ns23="http://www.govtalk.gov.uk/uk/fr/tax/dpl-core/2013-10-01">
<head>
    <meta name="PostingEntryNumber" content="4" />
    <meta name="PeriodRecordNumber" content="2341" />
    <meta content="application/xhtml+xml; charset=UTF-8" http-equiv="Content-Type" />
    <meta name="description" content="iXBRL report production" />
    <meta name="Mode" content="CH" />
    <meta http-equiv="X-UA-Compatible" content="IE=8" />

    <title>Shortt Orthopaedics Limited - Limited company - abbreviated - 11.6</title>
    <style type="text/css">
        @media print
        {
            hr { display:none; }
            .portraitpage
            {
                min-height:273mm;
                max-width:170mm;
            }
            .landscapepage
            {
                min-height:170mm;
                max-width:273mm;
            }
        }
        @media screen
        {
            .portraitpage
            {
                max-width:170mm;
                min-height:273mm;
                margin:12mm 20mm 12mm 20mm;
            }
            .landscapepage
            {
                max-width:273mm;
                min-height:170mm;
                margin:12mm 20mm 12mm 20mm;
            }
        }
        body{ margin:0px; font-size:1.3em; }
        td{ padding:0px; }
        div.portraitpage{ page-break-after:always; position:relative; }
        div.landscapepage{ page-break-after:always; position:relative; }
            div.header{ position:relative; }
            div.footer{ left:0px; right:0px; bottom:0px; text-align:center; position:absolute; }
    div.container{ position:relative; }
                    div.maintext{ width:100.00%; position:relative; }
                    div.tagged_blob{ width:100.00%; position:relative; }
                                table.iris_table{ width:100.00%; border-collapse:collapse; }
                table.iris_table_header{ width:100.00%; border-collapse:collapse; }
                table.iris_table_footer{ width:100.00%; border-collapse:collapse; }
        div.hr.iris_hr{ width:100.00%; }
            td.total_single{ border-top:thin solid black; }
            td.total_double{ border-top:double black; }
        .ta_10{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_11{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_12{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_13{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_20{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_21{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_22{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_23{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_30{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_31{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_32{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_33{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_40{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_41{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_42{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_43{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_50{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_51{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_52{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_53{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_60{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_61{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_62{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_63{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_70{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_71{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_72{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_73{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_80{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_81{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_82{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_83{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_90{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_91{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_92{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_93{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_100{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_101{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_102{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_103{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_110{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_111{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_112{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_113{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_120{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_121{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_122{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_123{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_130{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; }
        .ta_131{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; }
        .ta_132{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; }
        .ta_133{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; }
        .ta_140{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_141{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_142{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_143{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
    </style>
</head>
<body xml:lang="en">
    <div style="display:none">
        <ix:header>
            <ix:hidden>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:NameAuthor" order="1" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionOrTitleAuthor" order="2" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:UKCompaniesHouseRegisteredNumber" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">07189486</ix:nonNumeric>
                <ix:nonNumeric contextRef="CountriesHypercube_FY_31_03_2014_Set1" name="ns7:CountryFormationOrIncorporation" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="CurrenciesHypercube_FY_31_03_2014_Set2" name="ns7:PrincipalCurrencyUsedInBusinessReport" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="EntityOfficersHypercube_FY_31_03_2014_Set3" name="ns5:NameDirectorSigningAccounts" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:StartDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">1.4.13</ix:nonNumeric>
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:EndDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric>
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:BalanceSheetDate" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityAccountsType" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Company accounts</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:LegalFormOfEntity" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Private Limited Company</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionPeriodCoveredByReport" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">FY</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityTrading" format="ixt2:booleantrue" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">true</ix:nonNumeric>

[stackoverflow limits body text] [stackoverflow限制了正文]

I think you need a two step approach. 我认为您需要采用两步法。

  • Use a HTML parser to get to the embedded XML in question 使用HTML解析器获取有问题的嵌入式XML
  • ...then use the DOM parser on the content ...然后在内容上使用DOM解析器

HTML is not always XML compliant (unless you're using XHTML which has become less fashionable). HTML并不总是与XML兼容的(除非您使用的XHTML已经不那么流行了)。 Browsers let lots of things slip like missing tags, single vs double quotes, attributes without values etc, this is probably why your site fails to parse. 浏览器让很多事情如丢失标签,单引号或双引号,没有值的属性等滑倒,这可能就是您的网站无法解析的原因。

Many are available. 许多可用。

According to the documentation, DTD validation always takes place , even when you tell it not to! 根据文档,即使您告诉DTD验证,它始终会进行验证

What you want to do is to create a new DTD that adds your namespace to the standard XHTML DTD; 您要做的是创建一个新的DTD,将您的命名空间添加到标准XHTML DTD中。 the W3 site discusses how to acheive this , and the example they give is for MathML: W3网站讨论了如何实现这一点 ,并且他们给出的示例适用于MathML:

First, define a content model module that instantiates the MathML DTD and connects it to the content model: 首先,定义一个内容模型模块,该模块实例化MathML DTD并将其连接到内容模型:

<!-- File: mathml-model.mod -->
<!ENTITY % XHTML1-math
     PUBLIC "-//W3C//DTD MathML 2.0//EN"
            "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" >
%XHTML1-math;

<!ENTITY % Inlspecial.extra 
     "%a.qname; | %img.qname; | %object.qname; | %map.qname; 
      | %Mathml.Math.qname;" >

Next, define a DTD driver that identifies our new content model module as the content model for the DTD, and hands off processing to the XHTML 1.1 driver (for example): 接下来,定义一个DTD驱动程序,将我们的新内容模型模块标识为DTD的内容模型,并将处理交给XHTML 1.1驱动程序(例如):

<!-- File: xhtml-mathml.dtd -->
<!ENTITY % xhtml-model.mod
      SYSTEM "mathml-model.mod" >
<!ENTITY % xhtml11.dtd
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" >
%xhtml11.dtd;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM