简体   繁体   中英

Extract HTML from xml

I want to extract html page from an xml file. Any ideas please ?

 <?xml ....>
      <first>
      </first>

         <second>
         </second>
      <xhtml>
          <html>
              .....some html code here
          </html>
      </xhtml>

I want to extract html page as it is from the above.

because xml and html markup is similar any xml parser might have issues with it. I would suggest when you save the html data in the xml file, you encode it to prevent the xml parser from having issues. Then when you recall the data from the xml you just need to decode it for use.

<?xml ....?
<first></first>
<second></second>
<markup>
    &lt;html&gt;
        code here
    &lt;/html&gt;
</markup>

when you decode the markup section it will look like this

<html>
    code here
</html>

You might find this of some use:

http://www.w3schools.com/xml/xml_parser.asp

You can extract the HTML from the XML using JavaScript. You can then create an element on your HTML page in JavaScript and dump the HTML in there. The only issue with this is that it seems that the XML data you're receiving has a HTML tag.

If you want to add the content to an existing page, then you would have to strip the html and body tags.

If you use python, extraction can be very easy.

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html='''
 <?xml >
    <first>
    </first>
        <second>
        </second>
    <xhtml>
        <html>
            .....some html code here
        </html>
    </xhtml>
'''
doc = SimplifiedDoc(html)
html = doc.xhtml.html
print (html)

First you need to install simplified_scrapy using pip.

pip install simplified_scrapy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM