简体   繁体   English

通过在python中的美丽汤的行表迭代

[英]Iterating through a table of rows with beautiful soup in python

I'm trying to parse through a table of rows using beautiful soup and save values of each row in a dict. 我正在尝试使用美丽的汤解析行表并保存dict中每行的值。

One hiccup is the structure of the table has some rows as the section headers. 一个打嗝是表的结构有一些行作为节标题。 So for any row with the class 'header' I want to define a variable called "section". 因此,对于具有类“header”的任何行,我想定义一个名为“section”的变量。 Here's what I have, but it's not working because it's saying ['class'] TypeError: string indices must be integers 这就是我所拥有的,但它没有用,因为它说['class'] TypeError: string indices must be integers

Here's what I have: 这就是我所拥有的:

for i in credits.contents:
    if i['class'] == 'header':
        section = i.contents
        DATA_SET[section] = {}
    else:
        DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
        DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
        DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents

Example of data: 数据示例:

<table class="credits">
    <tr class="header">
        <th colspan="3"><h1>HEADER NAME</h1></th>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA</p></td>
        <td class="data_point_2"><p>DATA</p></td>
        <td class="data_point_3"><p>DATA</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA</p></td>
        <td class="data_point_2"><p>DATA</p></td>
        <td class="data_point_3"><p>DATA</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA</p></td>
        <td class="data_point_2"><p>DATA</p></td>
        <td class="data_point_3"><p>DATA</p></td>
    </tr>
    <tr class="header">
        <th colspan="3"><h1>HEADER NAME</h1></th>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA</p></td>
        <td class="data_point_2"><p>DATA</p></td>
        <td class="data_point_3"><p>DATA</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA</p></td>
        <td class="data_point_2"><p>DATA</p></td>
        <td class="data_point_3"><p>DATA</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA</p></td>
        <td class="data_point_2"><p>DATA</p></td>
        <td class="data_point_3"><p>DATA</p></td>
    </tr>
</table>

Here is one solution, with a slight adaptation of your example data so that the result is clearer: 这是一个解决方案,稍微调整您的示例数据,以便结果更清晰:

from BeautifulSoup import BeautifulSoup
from pprint import pprint

html = '''<body><table class="credits">
    <tr class="header">
        <th colspan="3"><h1>HEADER 1</h1></th>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA11</p></td>
        <td class="data_point_2"><p>DATA12</p></td>
        <td class="data_point_3"><p>DATA12</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA21</p></td>
        <td class="data_point_2"><p>DATA22</p></td>
        <td class="data_point_3"><p>DATA23</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA31</p></td>
        <td class="data_point_2"><p>DATA32</p></td>
        <td class="data_point_3"><p>DATA33</p></td>
    </tr>
    <tr class="header">
        <th colspan="3"><h1>HEADER 2</h1></th>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA11</p></td>
        <td class="data_point_2"><p>DATA12</p></td>
        <td class="data_point_3"><p>DATA13</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA21</p></td>
        <td class="data_point_2"><p>DATA22</p></td>
        <td class="data_point_3"><p>DATA23</p></td>
    </tr>
    <tr>
        <td class="data_point_1"><p>DATA31</p></td>
        <td class="data_point_2"><p>DATA32</p></td>
        <td class="data_point_3"><p>DATA33</p></td>
    </tr>
</table></body>'''

soup = BeautifulSoup(html)
rows = soup.findAll('tr')

section = ''
dataset = {}
for row in rows:
    if row.attrs:
        section = row.text
        dataset[section] = {}
    else:
        cells = row.findAll('td')
        for cell in cells:
            if cell['class'] in dataset[section]:
                dataset[section][ cell['class'] ].append( cell.text )
            else:
                dataset[section][ cell['class'] ] = [ cell.text ]

pprint(dataset)

Produces: 生产:

{u'HEADER 1': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
               u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
               u'data_point_3': [u'DATA12', u'DATA23', u'DATA33']},
 u'HEADER 2': {u'data_point_1': [u'DATA11', u'DATA21', u'DATA31'],
               u'data_point_2': [u'DATA12', u'DATA22', u'DATA32'],
               u'data_point_3': [u'DATA13', u'DATA23', u'DATA33']}}

EDIT ADAPTATION OF YOUR SOLUTION 编辑适应您的解决方案

Your code is neat and has only a couple of issues. 您的代码很整洁,只有几个问题。 You use contents in places where you shoul duse text or findAll -- I repaired that below: 你在你应该删除textfindAll地方使用contents - 我修复了以下内容:

soup = BeautifulSoup(html)
credits = soup.find('table')

section = ''
DATA_SET = {}

for i in credits.findAll('tr'):
    if i.get('class', '') == 'header':
        section = i.text
        DATA_SET[section] = {}
    else:
        DATA_SET[section]['data_point_1'] = i.find('td', {'class' : 'data_point_1'}).find('p').contents
        DATA_SET[section]['data_point_2'] = i.find('td', {'class' : 'data_point_2'}).find('p').contents
        DATA_SET[section]['data_point_3'] = i.find('td', {'class' : 'data_point_3'}).find('p').contents

print DATA_SET

Please note that if successive cells have the same data_point class, then successive rows will replace earlier ones. 请注意,如果连续的单元格具有相同的data_point类,则连续的行将替换先前的行。 I suspect this is not an issue in your real dataset, but that is why your code would return this, abbreviated, result: 我怀疑这不是你的真实数据集中的问题,但这就是为什么你的代码会返回这个缩写结果:

{u'HEADER 2': {'data_point_2': [u'DATA32'],
               'data_point_3': [u'DATA33'],
               'data_point_1': [u'DATA31']},
 u'HEADER 1': {'data_point_2': [u'DATA32'],
               'data_point_3': [u'DATA33'],
               'data_point_1': [u'DATA31']}}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM