简体   繁体   中英

Parsing HTML tag with “:” with lxml

I am new in python and I'm trying to parse a Html page with lxml. I want to get text from <p> tag. But inside it I have a strange tag like this:

  <p style="margin-left:0px;padding:0 0 0 0;float:left;">
       <g:plusone size="medium">
       </g:plusone>
      </p>

How can I ignore this tag inside <p> ? I want to cut all tags with ":" inside any html page,because another functions of lxml didn't work properly with tags like this.

parser=etree.HTMLParser() 
tree = etree.parse('problemtags.html',parser) 
root=tree.getroot() 
text = [ b.text for b in root.iterfind(".//p")] 

I expect to get some text inside <p> tags.But when i look like this, it fails on fragment like above. it writes: "b'Tag g:plusone invalid'". All i need - it is ignore all incorect tags like this. I don't know exactly how many tags like this i will have in future, but i think a problem really in ":" now, because when I use ".tag" and get name,it is just "plusone",not "g:plusone".

Here is a way I found to clean up the html:

from lxml import etree
from StringIO import StringIO

s = '''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
   <g:plusone size="medium">
   </g:plusone>
  </p>'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(s), parser)
result = etree.tostring(tree.getroot(),pretty_print=True,method="html")
print result

This prints

<html><body><p style="margin-left:0px;padding:0 0 0 0;float:left;">
       <plusone size="medium">
       </plusone>
      </p></body></html>

To get an etree.Element reference, namely an etree._Element, from an etree._ElementTree, just

root = tree.getroot()
print type(root) # prints lxml.etree._Element

According to _Element-class , lxml.etree._Element is the class of document instance references, in other words its what results from instantiating etree.Element, for example

el = etree.Element("an_etree.Element_reference")
print type(el) # prints lxml.etree._Element

The g: is a namespace prefix. The actual tag name is only plusone . So, lxml is correct in only returning plusone as the tag name. See a summary of namespaces here .

As I understand it, lxml's HTML Parser is not namespace aware. However, the XML Parser is. Presumably, given that this HTML document contains XML, it is most likely actually an XHTML document (if not, then it is probably an invalid HTML document and you cannot expect lxml to parse it correctly). Therefore, you need to run it through the XML Parser rather than HTML Parser. lxml's namespace API is explained in their tutorial .

However, with the fragment you provided the parser returns this:

>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
...        <g:plusone size="medium">
...        </g:plusone>
...       </p>''')
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "lxml.etree.pyx", line 3032, in lxml.etree.fromstring (src\lxml\lxml.etree.c:68121)
  File "parser.pxi", line 1786, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:102470)
  File "parser.pxi", line 1674, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:101299)
  File "parser.pxi", line 1074, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:96481)
  File "parser.pxi", line 582, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:91290)
  File "parser.pxi", line 683, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:92476)
  File "parser.pxi", line 622, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:91772)
lxml.etree.XMLSyntaxError: Namespace prefix g on plusone is not defined, line 2, column 23

Note that it complains that the " Namespace prefix g on plusone is not defined ." Presumably, elsewhere in your document the namespace prefix is defined. As I don't know what that is, I'll just make something up and define if on the plusone tag in your fragment:

>>> d = etree.fromstring('''<p style="margin-left:0px;padding:0 0 0 0;float:left;">
...        <g:plusone xmlns:g="something" size="medium">
...        </g:plusone>
...       </p>''')
>>> d
<Element p at 0x2563cd8>
>>> d.tag
'p'
>>> d[0]
<Element {something}plusone at 0x2563940>
>>> d[0].tag
'{something}plusone' 

Notice that the g: prefix was replaced with the actual namespace ( {something} in this case as I set is like so: xmlns:g="something" ). Usually the namespace would actually be a URI. So you may find that your tag looks something like this: {http://where.it/is/from.xml}plusone

Nevertheless, I find working with namespaces rather bothersome when they are not necessary. You may actually find it easier to use the HTML parser which ignores the namespaces. Now that you know that the tag is named plusone , not g:plusone you may be able get on with your work using just the HTML parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM