简体   繁体   中英

read xml file using lxml get error EntityRef

i use lxml to read a xml file which has structure like bellow

    <domain>http://www.trademe.co.nz</domain>         
    <start>http://www.trademe.co.nz/Browse/CategoryAttributeSearchResults.aspx?search=1&cid=5748&sidebar=1&rptpath=350-5748-4233-&132=FLAT&134=&153=&29=&122=0&122=0&59=0&59=0&178=0&178=0&sidebarSearch_keypresses=0&sidebarSearch_suggested=0</start>

and my python code is:

from lxml import etree

tree = etree.parse('metaWeb.xml') 

when i run it i get entityref: expecting ';' error

however, when i remove & symbol in xml file, everything is fine.

how can i solve that error?

The problem is that this isn't valid XML. In XML, a & symbol always starts an entity reference , like &#1234; for the character U+04D2 (aka Ӓ ), &quot; for the character " , or some custom entity defined in your document/DTD/schema.*

If you want to put a literal & into a string, you have to replace it with something else, typically &amp; , which is a character entity reference for the ampersand character.

So, if you're sure there are no actual entity references in your document, just un-escaped ampersands, you can fix it pretty simply:

with open('metaWeb.xml') as f:
    xml = f.read().replace('&', '&amp;')
tree = etree.fromstring(xml)

However, a better solution, if possible, is to fix whatever program is generating this incorrect XML.


* This is slightly misleading quite true; a numeric character reference is not actually an entity reference. Also, a character entity reference like &quot; or &amp; is the same as any other reference with replacement text, the entities just happen to be implicitly defined by the XML/HTML base DTDs. But lxml , like most XML software, uses the term "entity reference" slightly more broadly than the standard.

Replace & with &amp; in your xml file, othewise your xml is not compliant to the XML standard.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM