简体   繁体   English

lxml将元素转换为elementtree

[英]lxml convert element to elementtree

The following test reads a file, and using lxml.html generates the leaf nodes of the DOM/Graph for the page. 以下测试读取文件,并使用lxml.html为页面生成DOM / Graph的叶节点。

However, I'm also trying to figure out how to get the input from a "string". 但是,我也试图弄清楚如何从“字符串”获取输入。 Using 运用

 lxml.html.fromstring(s)

doesn't work, as this generates a "Element" as opposed to an "ElementTree". 不起作用,因为这会生成“元素”而不是“ElementTree”。

So, I'm trying to figure out how to convert an element to an ElementTree. 所以,我想弄清楚如何将元素转换为ElementTree。

Thoughts 思考

test code:: 测试代码::

import lxml.html
from lxml import etree    # trying this to see if needed 
                          # to convert from element to elementtree


  #cmd='cat osu_test.txt'
  cmd='cat o2.txt'
  proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
  s=proc.communicate()[0].strip()

  # s contains HTML not XML text
  #doc = lxml.html.parse(s)
  doc = lxml.html.parse('osu_test.txt')
  doc1 = lxml.html.fromstring(s)

  for node in doc.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

  nt = etree.ElementTree(doc1)        <<<<< doesn't work.. so what will??
  for node in nt.iter():
  if len(node) == 0:
     print "aaa ",node.tag, doc.getpath(node)
     #print "aaa ",node.tag

=============================== ===============================

update::: 更新:::

(parsing html instead of xml) Added the changes suggested by Abbas. (解析html而不是xml)添加了Abbas建议的更改。 got the following errs: 得到以下错误:

    doc1 = etree.fromstring(s)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48621)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72232)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71093)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67862)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64244)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65165)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64508)
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 48, column 220

UPDATE::: UPDATE :::

Managed to get the test working. 管理以使测试工作。 I'm not exactly sure why. 我不确定为什么。 If someone with py chops wants to provide an explanation, that would help future people who stumble on this. 如果有py chop的人想要提供解释,这将有助于未来的人偶然发现这一点。

from cStringIO import StringIO
from lxml.html import parse

doc1 = parse(StringIO(s))

for node in doc1.iter():
    if len(node) == 0:
        print "aaa ", node.tag, doc1.getpath(node)

it appears that the StringIO module/class implements IO functionality which satisfies what the parse package needs to go ahead and process the input string for the test html. 似乎StringIO模块/类实现了IO功能,它满足了解析包需要继续处理测试html的输入字符串。 similar to what casting provides in other languages perhaps... 类似于铸造提供的其他语言也许......

thanks 谢谢

To get the root tree from an _Element (generated with lxml.html.fromstring ), you can use the getroottree method: 要从_Element (使用lxml.html.fromstring生成)获取根树,可以使用getroottree方法:

doc = lxml.html.parse(s)
tree = doc.getroottree()

The etree.fromstring method parses an XML string and returns a root element. etree.fromstring方法解析XML字符串并返回根元素。 The etree.ElementTree class is a tree wrapper around an element and as such requires an element for instantiation. etree.ElementTree类是元素周围的树包装器,因此需要一个元素进行实例化。

Therefore, passing the root element to the etree.ElementTree() constructor should give you what you want: 因此,将根元素传递给etree.ElementTree()构造函数应该可以为您提供所需的内容:

root = etree.fromstring(s)
nt = etree.ElementTree(root)

An _Element , such that is returned by a call like: 一个_Element ,通过如下调用返回:

tree = etree.HTML(result.read(), etree.HTMLParser())

Can be made an _ElementTree like so: 可以像这样制作一个_ElementTree

tree    = tree.getroottree() # convert _Element to _ElementTree

Hope that's what you expect. 希望这是你所期望的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM