简体   繁体   English

使用BeautifulSoup将html表解析为python字典

[英]Parsing html table with BeautifulSoup to python dictionary

This is an html code than I'm trying to parse with BeautifulSoup: 这是比我尝试用BeautifulSoup解析的html代码:

<table>
          <tr>
            <th width="100">menu1</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data1</li>
                    <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                    ... (amount of this tags isn't fixed)
              </ul>
            </td>
          </tr>
          <tr>
            <th width="100">menu2</th>
            <td>
              <ul class="classno1" style="margin-bottom:10;">
                    <li>Some data2</li>
                    <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                    <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                    <li>Some data3</li>
                    ... (amount of this tags isn't fixed too)
              </ul>
            </td>
          </tr>
</table>

The output I would like to get is a dictionary like this: 我想得到的输出是这样的字典:

DICT = {
    'menu1': ['Some data1','Foo1 Bar1'],
    'menu2': ['Some data2','Foo2 Bar2','Foo3 Bar3','Some data3'],
}

As I already mentioned in the code, amount of <li> tags is not fixed. 正如我在代码中已经提到的, <li>标签的数量是固定的。 Additionally, there could be: 此外,可能有:

  • menu1 and menu2 menu1和menu2
  • just menu1 只是menu1
  • just menu2 只是menu2
  • no menu1 and menu2 (just <table></table> ) 没有menu1和menu2(仅<table></table>

    so eg it could looks just like this: 所以例如看起来可能像这样:

     <table> <tr> <th width="100">menu1</th> <td> <ul class="classno1" style="margin-bottom:10;"> <li>Some data1</li> <li>Foo1<a href="/link/to/bar1">Bar1</a></li> ... (amount of this tags isn't fixed) </ul> </td> </tr> </table> 

    I was trying to use this example but with no success. 我试图使用此示例,但没有成功。 I think it's because of that <ul> tags, I can't read proper data from table. 我认为是因为<ul>标签,我无法从表中读取适当的数据。 Problem for me is also variable amount of menus and <li> tags. 对我来说,问题还在于menus<li>标签的数量可变。 So my question is how to parse this particular table to python dictionary? 所以我的问题是如何将此特定表解析为python字典? I should mention that I already parsed some simple data with .text attribute of BeautifulSoup handler so it would be nice if I could just keep it as is. 我应该提到的是,我已经使用BeautifulSoup处理程序的.text属性解析了一些简单的数据,因此,如果我可以保持原样,那将是很好的。

     request = c.get('http://example.com/somepage.html) soup = bs(request.text) 

    and this is always the first table of the page, so I can get it with: 这始终是页面的第一张表,因此我可以通过以下方式获得它:

     table = soup.find_all('table')[0] 

    Thank you in advance for any help. 预先感谢您的任何帮助。

  • html = """<table>
              <tr>
                <th width="100">menu1</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data1</li>
                        <li>Foo1<a href="/link/to/bar1">Bar1</a></li>
                  </ul>
                </td>
              </tr>
              <tr>
                <th width="100">menu2</th>
                <td>
                  <ul class="classno1" style="margin-bottom:10;">
                        <li>Some data2</li>
                        <li>Foo2<a href="/link/to/bar2">Bar2</a></li>
                        <li>Foo3<a href="/link/to/bar3">Bar3</a></li>
                        <li>Some data3</li>
                  </ul>
                </td>
              </tr>
    </table>"""
    
    import BeautifulSoup as bs
    
    soup = bs.BeautifulSoup(html)
    
    table = soup.findAll('table')[0]
    
    results = {}
    
    th = table.findChildren('th')#,text=['menu1','menu2'])
    
    for x in th:
        #print x
        results_li = []
        li = x.nextSibling.nextSibling.findChildren('li')
        for y in li:
            #print y.next
            results_li.append(y.next)
        results[x.next] = results_li
    
    print results
    

    .

    {
        u'menu2': [u'Some data2', u'Foo2', u'Foo3', u'Some data3'], 
        u'menu1': [u'Some data1', u'Foo1']
    }
    

    声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM