简体   繁体   English

python beautifulsoup 按部分循环遍历表格行

[英]python beautifulsoup loop through table rows by section

I'm new to beautifulsoup and python, and I'm pretty sure this is a dead-simple problem but I can't seem to get anywhere solving it.我是 beautifulsoup 和 python 的新手,我很确定这是一个非常简单的问题,但我似乎无法解决它。

I'm trying to loop through rows of an html table, based on "header" rows that group the table by types of candy.我正在尝试根据按糖果类型对表格进行分组的“标题”行遍历 html 表格的行。 My table looks like this:我的桌子看起来像这样: 在此处输入图片说明

I want the loop to get the date under each candy heading.我希望循环获取每个糖果标题下的日期。 So the iterations would get data like this:所以迭代会得到这样的数据:

first loop iteration: candy_type: kitkat, location: Mall 1, Planned: 63, Actual: 0, Diff: 25第一次循环迭代: candy_type:kitkat,位置:Mall 1,计划:63,实际:0,差异:25

second iteration: candy_type: kitkat, location: Mall 2, Planned: 7, Actual: 0, Diff: 6第二次迭代: candy_type:kitkat,位置:Mall 2,计划:7,实际:0,差异:6

... last iteration: candy_type: Skittles, location: Building 2, Planned: 320, Actual: 236, Diff: 0 ... 最后一次迭代: candy_type:Skittles,位置:2 号楼,计划:320,实际:236,差异:0

This is the table code:这是表代码:

<TABLE BORDER="1" WIDTH="100%">
   <TR>
      <TH COLSPAN=4>Candy</TH>
   </TR>
   <TR BGCOLOR=#CEE3F6>
      <TD COLSPAN=4>
         <FONT  FACE=Arial>
            <center><b>KitKat</b></center>
         </FONT>
      </TD>
   </TR>
   <TR BGCOLOR=#336699>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
   </TR>
   <TR>
      <TD>Mall 1</TD>
      <TD>63</TD>
      <TD>0</TD>
      <TD>25</TD>
   </TR>
   <TR>
      <TD>Mall 2</TD>
      <TD>7</TD>
      <TD>0</TD>
      <TD>6</TD>
   </TR>
   <TR BGCOLOR=#CEE3F6>
      <TD COLSPAN=4>
         <FONT  FACE=Arial>
            <center><b>OH Henry</b></center>
         </FONT>
      </TD>
   </TR>
   <TR BGCOLOR=#336699>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
   </TR>
   <TR>
      <TD>Warehouse 1</TD>
      <TD>195</TD>
      <TD>122</TD>
      <TD>30</TD>
   </TR>
   <TR>
      <TD>Warehouse 2</TD>
      <TD>96</TD>
      <TD>76</TD>
      <TD>6</TD>
   </TR>
   <TR BGCOLOR=#CEE3F6>
      <TD COLSPAN=4>
         <FONT  FACE=Arial>
            <center><b>Skittles</b></center>
         </FONT>
      </TD>
   </TR>
   <TR BGCOLOR=#336699>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
   </TR>
   <TR>
      <TD>Building 1</TD>
      <TD>120</TD>
      <TD>90</TD>
      <TD>5</TD>
   </TR>
   <TR>
      <TD>Building 2</TD>
      <TD>320</TD>
      <TD>236</TD>
      <TD>0</TD>
   </TR>
</TABLE> 

so I tried所以我试过了

from bs4 import BeautifulSoup
import urllib

readUrl = urllib.urlopen('test.html').read()
soup = BeautifulSoup(readUrl)
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"})
for type in candytype:
    print type

This prints out the three candy types like this:这将打印出三种糖果类型,如下所示:

<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>KitKat</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>OH Henry</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>Skittles</b></center>
</td>
</tr>

I thought I could group the candy "headers" (ie the tr elements whose bgcolor set to #CEE3F6 ) and then iterate on that basis, but I cannot figure out how to get further into the data.我以为我可以将糖果“标题”(即bgcolor设置为#CEE3F6的 tr 元素) #CEE3F6 ,然后在此基础上进行迭代,但我无法弄清楚如何进一步了解数据。

Any ideas?有任何想法吗?

Find all the rows, then iterate through them.找到所有行,然后遍历它们。 When you find one that contains the name of a candy (by the colour of the row), keep that name.当您找到包含糖果名称的糖果(按行的颜色)时,请保留该名称。 Now identify the next siblings of that row.现在确定该行的下一个兄弟姐妹。 Skip the first, which will be a heading but capture subsequent texts from the td elements.跳过第一个,这将是一个标题,但从td元素中捕获后续文本。 You know you've found the last sibling when you encounter the name of a different candy (again by the colour of the row).当您遇到不同糖果的名称时(再次通过行的颜色),您就知道找到了最后一个兄弟姐妹。

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('justTable.htm').read(), 'lxml')
>>> trs = soup.findAll('tr')
>>> for tr in trs:
...     if 'bgcolor' in tr.attrs and tr.attrs['bgcolor']=='#CEE3F6':
...         candy = tr.text.strip()
...         first = True
...         for sibs in tr.fetchNextSiblings():
...             if first:
...                 first = False
...                 continue
...             if 'bgcolor' in sibs.attrs and sibs.attrs['bgcolor']=='#CEE3F6':
...                 break   
...             [candy]+sibs.text.strip().split('\n')
... 
['KitKat', 'Mall 1', '63', '0', '25']
['KitKat', 'Mall 2', '7', '0', '6']
['OH Henry', 'Warehouse 1', '195', '122', '30']
['OH Henry', 'Warehouse 2', '96', '76', '6']
['Skittles', 'Building 1', '120', '90', '5']
['Skittles', 'Building 2', '320', '236', '0']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM