使用Python 2.7的HTML解析树

Question

I was trying to get configure one parse tree for the below HTML table,but couldn't form it.I want to see how the tree structure looks like!can anyone help me here? 我试图为下面的HTML表配置一个解析树，但是无法形成它。我想看看树结构是什么样的！有人可以帮助我吗？

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

EDIT 编辑

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\matt>easy_install ete2
Searching for ete2
Reading http://pypi.python.org/simple/ete2/
Reading http://ete.cgenomics.org
Reading http://ete.cgenomics.org/releases/ete2/
Reading http://ete.cgenomics.org/releases/ete2
Best match: ete2 2.1rev539
Downloading http://ete.cgenomics.org/releases/ete2/ete2-2.1rev539.tar.gz
Processing ete2-2.1rev539.tar.gz
Running ete2-2.1rev539\setup.py -q bdist_egg --dist-dir c:\users\arupra~1\appdat
a\local\temp\easy_install-sypg3x\ete2-2.1rev539\egg-dist-tmp-zemohm

Installing ETE (A python Environment for Tree Exploration).

Checking dependencies...
numpy cannot be found in your python installation.
Numpy is required for the ArrayTable and ClusterTree classes.
MySQLdb cannot be found in your python installation.
MySQLdb is required for the PhylomeDB access API.
PyQt4 cannot be found in your python installation.
PyQt4 is required for tree visualization and image rendering.
lxml cannot be found in your python installation.
lxml is required from Nexml and Phyloxml support.

However, you can still install ETE without such functionality.
Do you want to continue with the installation anyway? [y,n]y
Your installation ID is: d33ba3b425728e95c47cdd98acda202f
warning: no files found matching '*' under directory '.'
warning: no files found matching '*.*' under directory '.'
warning: manifest_maker: MANIFEST.in, line 4: path 'doc/ete_guide/' cannot end w
ith '/'

warning: manifest_maker: MANIFEST.in, line 5: path 'doc/' cannot end with '/'

warning: no previously-included files matching '*.pyc' found under directory '.'

zip_safe flag not set; analyzing archive contents...
Adding ete2 2.1rev539 to easy-install.pth file
Installing ete2 script to C:\Python27\Scripts

Installed c:\python27\lib\site-packages\ete2-2.1rev539-py2.7.egg
Processing dependencies for ete2
Finished processing dependencies for ete2

Answer 1

This answer comes a bit late, but still I'd like to share it: 这个答案有点晚了，但我还想分享一下： 在此输入图像描述

I used networkx and lxml (which I found to allow much more elegant traversal of the DOM-tree). 我使用了networkx和lxml （我发现它允许对DOM树进行更优雅的遍历）。 However, the tree-layout depends on graphviz and pygraphviz installed. 但是，树形布局取决于安装的graphviz和pygraphviz 。 networkx itself would just distribute the nodes somehow on the canvas. networkx本身只是在画布上以某种方式分发节点。 The code actually is longer than required cause I draw the labels myself to have them boxed (networkx provides for drawing the labels but it doesn't pass on the bbox keyword to matplotlib). 代码实际上比需要的时间长，因为我自己绘制标签以将它们装箱（networkx提供绘制标签，但它不会将bbox关键字传递给matplotlib）。

import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout

raw = "...your raw html"

def traverse(parent, graph, labels):
    labels[parent] = parent.tag
    for node in parent.getchildren():
        graph.add_edge(parent, node)
        traverse(node, graph, labels)

G = nx.DiGraph()
labels = {}     # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)

pos = graphviz_layout(G, prog='dot')

label_props = {'size': 16,
               'color': 'black',
               'weight': 'bold',
               'horizontalalignment': 'center',
               'verticalalignment': 'center',
               'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
              'fc': "grey",
              'ec': "b",
              'lw': 1.5}

nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()

for node, label in labels.items():
        x, y = pos[node]
        ax.text(x, y, label,
                bbox=bbox_props,
                **label_props)

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()

Changes to the code if you prefer (or have) to use BeautifulSoup: 如果您更喜欢（或有）使用BeautifulSoup，请更改代码：

I'm no expert... just looked at BS4 for the first time,... but it works: 我不是专家......只是第一次看BS4，但是它有效：

#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString

...

def traverse(parent, graph, labels):
    labels[hash(parent)] = parent.name
    for node in parent.children:
        if isinstance(node, NavigableString):
            continue
        graph.add_edge(hash(parent), hash(node))
        traverse(node, graph, labels)

...

#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)

...

Answer 2

Python modules: Python模块：
1. ETE , but it requires Newick format data. 1. ETE ，但它需要Newick格式数据。
2. GraphViz + pydot . 2. GraphViz + pydot 。 See this SO answer . 看到这个答案。

Javascript: 使用Javascript：
The amazing d3 TreeLayout which uses JSON format. 令人惊叹的d3 TreeLayout使用JSON格式。

If you're using ETE then you'll need to convert html to newick format. 如果您正在使用ETE，则需要将html转换为newick格式。 Here's a small example I made: 这是我做的一个小例子：

from lxml import html
from urllib import urlopen


def getStringFromNode(node):
    # Customize this according to
    # your requirements.
    node_string = node.tag
    if node.get('id'):
        node_string += '-' + node.get('id')
    if node.get('class'):
        node_string += '-' + node.get('class')
    return node_string


def xmlToNewick(node):
    node_string = getStringFromNode(node)
    nwk_children = []
    for child in node.iterchildren():
        nwk_children.append(xmlToNewick(child))
    if nwk_children:
        return "(%s)%s" % (','.join(nwk_children), node_string)
    else:
        return node_string


def main():
    html_page = html.fromstring(urlopen('http://www.google.co.in').read())
    newick_page = xmlToNewick(html_page)
    return newick_page

main()

Output ( http://www.google.co.in in newick format): 输出（ http://www.google.co.in inickick格式）：

'((meta,title,script,style,style,script)head,(script,textarea-csi,(((b-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,(u)a-gb1)nobr)div-gbar,((span-gbn-gbi,span-gbf-gbf,span-gbe,a-gb4,a-gb4,a-gb_70-gb4)nobr)div-guser,div-gbh,div-gbh)div-mngb,(br-lgpd,(((div)div-hplogo)div,br)div-lga,(((td,(input,input,input,(input-lst)div-ds,br,((input-lsb)span-lsbb)span-ds,((input-lsb)span-lsbb)span-ds)td,(a,a)td-fl sblc)tr)table,input-gbv)form,div-gac_scont,(br,((a,a,a,a,a,a,a,a,a)font-addlang,br,br)div-als)div,(((a,a,a,a,a-fehl)div-fll)div,(a)p)span-footer)center,div-xjsd,(script)div-xjsi,script)body)html'

After that you can use ETE as showen in there examples. 之后，您可以在示例中使用ETE。

Hope that helps. 希望有所帮助。

使用Python 2.7的HTML解析树

问题描述

2 个解决方案

解决方案1
12 已采纳 2013-01-06 19:45:17

Changes to the code if you prefer (or have) to use BeautifulSoup: 如果您更喜欢（或有）使用BeautifulSoup，请更改代码：

解决方案2
6 2013-01-06 13:08:08

使用Python 2.7的HTML解析树

问题描述

2 个解决方案

解决方案1 12 已采纳 2013-01-06 19:45:17

Changes to the code if you prefer (or have) to use BeautifulSoup: 如果您更喜欢（或有）使用BeautifulSoup，请更改代码：

解决方案2 6 2013-01-06 13:08:08

解决方案1
12 已采纳 2013-01-06 19:45:17

解决方案2
6 2013-01-06 13:08:08