简体   繁体   中英

Parsing HTML tags into XML

I'm trying to parse XML that's embedded in the HTML file below. Here's the detail from one of the tags:

           DOM<tr class="iris_table_row">
                <td style=" width:37.50%; text-align:left; " class="ta_10"><span class="ta_10">Tangible assets</span></td>
                <td style=" width:2.50%; text-align:right; " class="ta_10"><span class="ta_10">2</span></td>
                <td style=" width:30.00%; text-align:right; " class="ta_61"><ix:nonFraction contextRef="cfwd_31_03_2014" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">7,956</ix:nonFraction></td>
                <td style=" width:1.25%; " class="ta_61" />
                <td style=" width:26.25%; text-align:right; " class="ta_60"><ix:nonFraction contextRef="cfwd_31_03_2013" name="ns5:TangibleFixedAssets" unitRef="GBP" decimals="0" format="ixt2:numdotdecimal" scale="0" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">5,402</ix:nonFraction></td>
                <td style=" width:1.25%; " class="ta_60" />
                <td style=" width:1.25%; " class="ta_10" />
            </tr>

I've tried using a DOM parser in java to do this but it doesn't recognize the XML tags.

The value of db.parse(fXmlFile) in the code below is "null".

File fXmlFile = new File("Prod223_1254_04903825_20140331 copy.xml");

    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setValidating(false);
    dbf.setNamespaceAware(true);
    dbf.setIgnoringComments(false);
    dbf.setIgnoringElementContentWhitespace(false);
    dbf.setExpandEntityReferences(false);
    DocumentBuilder db = dbf.newDocumentBuilder();

    System.out.println(db.parse(fXmlFile));

How can I get the all the tags and information into java? Ideally I'd be able to load them into a bean.

Here is an example of the type of file I'm trying to parse.

 <?xml version="1.0" encoding="utf-8"?><html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" xmlns:ixt="http://www.xbrl.org/inlineXBRL/transformation/2010-04-20" xmlns:ixt2="http://www.xbrl.org/inlineXBRL/transformation/2011-07-31" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xl="http://www.xbrl.org/2003/XLink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:iris="http://www.iris.co.uk/ixbrl" xmlns:ns0="http://www.xbrl.org/uk/gaap/core-full/2009-09-01" xmlns:ns5="http://www.xbrl.org/uk/gaap/core/2009-09-01" xmlns:ns6="http://www.xbrl.org/uk/reports/direp/2009-09-01" xmlns:ns7="http://www.xbrl.org/uk/cd/business/2009-09-01" xmlns:ns8="http://www.xbrl.org/uk/all/types/2009-09-01" xmlns:ns9="http://xbrl.org/2005/xbrldt" xmlns:ns10="http://www.xbrl.org/uk/all/common/2009-09-01" xmlns:ns11="http://www.xbrl.org/2006/ref" xmlns:ns12="http://www.xbrl.org/uk/cd/countries/2009-09-01" xmlns:ns13="http://www.xbrl.org/uk/all/ref/2009-09-01" xmlns:ns14="http://www.xbrl.org/uk/cd/currencies/2009-09-01" xmlns:ns15="http://www.xbrl.org/uk/cd/exchanges/2009-09-01" xmlns:ns16="http://www.xbrl.org/uk/cd/languages/2009-09-01" xmlns:ns17="http://www.xbrl.org/2004/ref" xmlns:ns18="http://www.xbrl.org/uk/all/gaap-ref/2009-09-01" xmlns:ns19="http://www.xbrl.org/uk/reports/aurep/2009-09-01" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:ns20="http://www.govtalk.gov.uk/uk/fr/tax/full-gaap-dpl/2013-10-01" xmlns:ns21="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap-main/2013-10-01" xmlns:ns22="http://www.govtalk.gov.uk/uk/fr/tax/dpl-gaap/2013-10-01" xmlns:ns23="http://www.govtalk.gov.uk/uk/fr/tax/dpl-core/2013-10-01">
<head>
    <meta name="PostingEntryNumber" content="4" />
    <meta name="PeriodRecordNumber" content="2341" />
    <meta content="application/xhtml+xml; charset=UTF-8" http-equiv="Content-Type" />
    <meta name="description" content="iXBRL report production" />
    <meta name="Mode" content="CH" />
    <meta http-equiv="X-UA-Compatible" content="IE=8" />

    <title>Shortt Orthopaedics Limited - Limited company - abbreviated - 11.6</title>
    <style type="text/css">
        @media print
        {
            hr { display:none; }
            .portraitpage
            {
                min-height:273mm;
                max-width:170mm;
            }
            .landscapepage
            {
                min-height:170mm;
                max-width:273mm;
            }
        }
        @media screen
        {
            .portraitpage
            {
                max-width:170mm;
                min-height:273mm;
                margin:12mm 20mm 12mm 20mm;
            }
            .landscapepage
            {
                max-width:273mm;
                min-height:170mm;
                margin:12mm 20mm 12mm 20mm;
            }
        }
        body{ margin:0px; font-size:1.3em; }
        td{ padding:0px; }
        div.portraitpage{ page-break-after:always; position:relative; }
        div.landscapepage{ page-break-after:always; position:relative; }
            div.header{ position:relative; }
            div.footer{ left:0px; right:0px; bottom:0px; text-align:center; position:absolute; }
    div.container{ position:relative; }
                    div.maintext{ width:100.00%; position:relative; }
                    div.tagged_blob{ width:100.00%; position:relative; }
                                table.iris_table{ width:100.00%; border-collapse:collapse; }
                table.iris_table_header{ width:100.00%; border-collapse:collapse; }
                table.iris_table_footer{ width:100.00%; border-collapse:collapse; }
        div.hr.iris_hr{ width:100.00%; }
            td.total_single{ border-top:thin solid black; }
            td.total_double{ border-top:double black; }
        .ta_10{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_11{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_12{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_13{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_20{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_21{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_22{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_23{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_30{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_31{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_32{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_33{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_40{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_41{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_42{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_43{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_50{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_51{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_52{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_53{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_60{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_61{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_62{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_63{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_70{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_71{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_72{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_73{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_80{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_81{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_82{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_83{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_90{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_91{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_92{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_93{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_100{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_101{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_102{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_103{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_110{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_111{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_112{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_113{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_120{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_121{ color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_122{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:700; }
        .ta_123{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Times New Roman"; font-size:13px; font-weight:400; }
        .ta_130{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; }
        .ta_131{ color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; }
        .ta_132{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:700; }
        .ta_133{ text-decoration:underline; color:rgb(0, 0, 0); font-family:"Courier New"; font-size:13px; font-weight:400; }
        .ta_140{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_141{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_142{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
        .ta_143{ color:rgb(0, 0, 0); font-family:"Arial"; font-size:13px; font-weight:400; }
    </style>
</head>
<body xml:lang="en">
    <div style="display:none">
        <ix:header>
            <ix:hidden>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:NameAuthor" order="1" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionOrTitleAuthor" order="2" tupleRef="XBRLDocumentAuthorGrouping_Group45" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL"></ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:UKCompaniesHouseRegisteredNumber" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">07189486</ix:nonNumeric>
                <ix:nonNumeric contextRef="CountriesHypercube_FY_31_03_2014_Set1" name="ns7:CountryFormationOrIncorporation" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="CurrenciesHypercube_FY_31_03_2014_Set2" name="ns7:PrincipalCurrencyUsedInBusinessReport" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="EntityOfficersHypercube_FY_31_03_2014_Set3" name="ns5:NameDirectorSigningAccounts" format="ixt2:nocontent" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL" />
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:StartDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">1.4.13</ix:nonNumeric>
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:EndDateForPeriodCoveredByReport" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric>
                <ix:nonNumeric contextRef="cfwd_31_03_2014" name="ns7:BalanceSheetDate" format="ixt2:datedaymonthyear" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">31.3.14</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityAccountsType" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Company accounts</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:LegalFormOfEntity" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">Private Limited Company</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:DescriptionPeriodCoveredByReport" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">FY</ix:nonNumeric>
                <ix:nonNumeric contextRef="FY_31_03_2014" name="ns7:EntityTrading" format="ixt2:booleantrue" xmlns:ix="http://www.xbrl.org/2008/inlineXBRL">true</ix:nonNumeric>

[stackoverflow limits body text]

I think you need a two step approach.

  • Use a HTML parser to get to the embedded XML in question
  • ...then use the DOM parser on the content

HTML is not always XML compliant (unless you're using XHTML which has become less fashionable). Browsers let lots of things slip like missing tags, single vs double quotes, attributes without values etc, this is probably why your site fails to parse.

Many are available.

According to the documentation, DTD validation always takes place , even when you tell it not to!

What you want to do is to create a new DTD that adds your namespace to the standard XHTML DTD; the W3 site discusses how to acheive this , and the example they give is for MathML:

First, define a content model module that instantiates the MathML DTD and connects it to the content model:

<!-- File: mathml-model.mod -->
<!ENTITY % XHTML1-math
     PUBLIC "-//W3C//DTD MathML 2.0//EN"
            "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd" >
%XHTML1-math;

<!ENTITY % Inlspecial.extra 
     "%a.qname; | %img.qname; | %object.qname; | %map.qname; 
      | %Mathml.Math.qname;" >

Next, define a DTD driver that identifies our new content model module as the content model for the DTD, and hands off processing to the XHTML 1.1 driver (for example):

<!-- File: xhtml-mathml.dtd -->
<!ENTITY % xhtml-model.mod
      SYSTEM "mathml-model.mod" >
<!ENTITY % xhtml11.dtd
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" >
%xhtml11.dtd;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM