使用BeautifulSoup将html表解析为python字典

Question

This is an html code than I'm trying to parse with BeautifulSoup: 这是比我尝试用BeautifulSoup解析的html代码：

<table>
          <tr>
            <th width="100">menu1</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data1</li>
                    <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                    ... (amount of this tags isn't fixed)
              </ul>
            </td>
          </tr>
          <tr>
            <th width="100">menu2</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data2</li>
                    <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                    <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                    <li>Some data3</li>
                    ... (amount of this tags isn't fixed too)
              </ul>
            </td>
          </tr>
</table>

The output I would like to get is a dictionary like this: 我想得到的输出是这样的字典：

DICT = {
    'menu1': ['Some data1','Foo1 Bar1'],
    'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}

As I already mentioned in the code, amount of <li> tags is not fixed. 正如我在代码中已经提到的， <li>标签的数量是固定的。 Additionally, there could be: 此外，可能有：

menu1 and menu2 menu1和menu2

just menu1 只是menu1

just menu2 只是menu2

no menu1 and menu2 (just <table></table> ) 没有menu1和menu2（仅<table></table> ）

so eg it could looks just like this: 所以例如看起来可能像这样：

 <table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>Some data1</li> <li>Foo1<a href="/link/to/bar1">Bar1</a></li> ... (amount of this tags isn't fixed) </ul> </td> </tr> </table>

I was trying to use this example but with no success. 我试图使用此示例，但没有成功。 I think it's because of that <ul> tags, I can't read proper data from table. 我认为是因为<ul>标签，我无法从表中读取适当的数据。 Problem for me is also variable amount of menus and <li> tags. 对我来说，问题还在于menus和<li>标签的数量可变。 So my question is how to parse this particular table to python dictionary? 所以我的问题是如何将此特定表解析为python字典？ I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is. 我应该提到的是，我已经使用BeautifulSoup处理程序的.text属性解析了一些简单的数据，因此，如果我可以保持原样，那将是很好的。

 request = c.get('http://example.com/somepage.html) soup = bs(request.text)

and this is always the first table of the page, so I can get it with: 这始终是页面的第一张表，因此我可以通过以下方式获得它：

 table = soup.find_all('table')[0]

Thank you in advance for any help. 预先感谢您的任何帮助。

Answer 1

html = """<table>
          <tr>
            <th width="100">menu1</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data1</li>
                    <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
              </ul>
            </td>
          </tr>
          <tr>
            <th width="100">menu2</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data2</li>
                    <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                    <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                    <li>Some data3</li>
              </ul>
            </td>
          </tr>
</table>"""

import BeautifulSoup as bs

soup = bs.BeautifulSoup(html)

table = soup.findAll('table')[0]

results = {}

th = table.findChildren('th')#,text=['menu1','menu2'])

for x in th:
    #print x
    results_li = []
    li = x.nextSibling.nextSibling.findChildren('li')
    for y in li:
        #print y.next
        results_li.append(y.next)
    results[x.next] = results_li

print results

. 。

{
    u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'], 
    u'menu1': [u'Some data1', u'Foo1']
}

使用BeautifulSoup将html表解析为python字典

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-06-23 02:46:02

使用BeautifulSoup将html表解析为python字典

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-06-23 02:46:02

解决方案1
1 已采纳 2014-06-23 02:46:02