i use lxml to read a xml file which has structure like bellow
<domain>http://www.trademe.co.nz</domain>
<start>http://www.trademe.co.nz/Browse/CategoryAttributeSearchResults.aspx?search=1&cid=5748&sidebar=1&rptpath=350-5748-4233-&132=FLAT&134=&153=&29=&122=0&122=0&59=0&59=0&178=0&178=0&sidebarSearch_keypresses=0&sidebarSearch_suggested=0</start>
and my python code is:
from lxml import etree
tree = etree.parse('metaWeb.xml')
when i run it i get entityref: expecting ';'
error
however, when i remove & symbol in xml file, everything is fine.
how can i solve that error?
The problem is that this isn't valid XML. In XML, a &
symbol always starts an entity reference , like Ӓ
for the character U+04D2
(aka Ӓ
), "
for the character "
, or some custom entity defined in your document/DTD/schema.*
If you want to put a literal &
into a string, you have to replace it with something else, typically &
, which is a character entity reference for the ampersand character.
So, if you're sure there are no actual entity references in your document, just un-escaped ampersands, you can fix it pretty simply:
with open('metaWeb.xml') as f:
xml = f.read().replace('&', '&')
tree = etree.fromstring(xml)
However, a better solution, if possible, is to fix whatever program is generating this incorrect XML.
* This is slightly misleading quite true; a numeric character reference is not actually an entity reference. Also, a character entity reference like "
or &
is the same as any other reference with replacement text, the entities just happen to be implicitly defined by the XML/HTML base DTDs. But lxml
, like most XML software, uses the term "entity reference" slightly more broadly than the standard.
Replace &
with &
in your xml file, othewise your xml is not compliant to the XML standard.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.