[英]python beautifulsoup loop through table rows by section
I'm new to beautifulsoup and python, and I'm pretty sure this is a dead-simple problem but I can't seem to get anywhere solving it.我是 beautifulsoup 和 python 的新手,我很确定这是一个非常简单的问题,但我似乎无法解决它。
I'm trying to loop through rows of an html table, based on "header" rows that group the table by types of candy.我正在尝试根据按糖果类型对表格进行分组的“标题”行遍历 html 表格的行。 My table looks like this:
我的桌子看起来像这样:
I want the loop to get the date under each candy heading.我希望循环获取每个糖果标题下的日期。 So the iterations would get data like this:
所以迭代会得到这样的数据:
first loop iteration: candy_type: kitkat, location: Mall 1, Planned: 63, Actual: 0, Diff: 25第一次循环迭代: candy_type:kitkat,位置:Mall 1,计划:63,实际:0,差异:25
second iteration: candy_type: kitkat, location: Mall 2, Planned: 7, Actual: 0, Diff: 6第二次迭代: candy_type:kitkat,位置:Mall 2,计划:7,实际:0,差异:6
... last iteration: candy_type: Skittles, location: Building 2, Planned: 320, Actual: 236, Diff: 0 ... 最后一次迭代: candy_type:Skittles,位置:2 号楼,计划:320,实际:236,差异:0
This is the table code:这是表代码:
<TABLE BORDER="1" WIDTH="100%">
<TR>
<TH COLSPAN=4>Candy</TH>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>KitKat</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Mall 1</TD>
<TD>63</TD>
<TD>0</TD>
<TD>25</TD>
</TR>
<TR>
<TD>Mall 2</TD>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>OH Henry</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Warehouse 1</TD>
<TD>195</TD>
<TD>122</TD>
<TD>30</TD>
</TR>
<TR>
<TD>Warehouse 2</TD>
<TD>96</TD>
<TD>76</TD>
<TD>6</TD>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>Skittles</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Building 1</TD>
<TD>120</TD>
<TD>90</TD>
<TD>5</TD>
</TR>
<TR>
<TD>Building 2</TD>
<TD>320</TD>
<TD>236</TD>
<TD>0</TD>
</TR>
</TABLE>
so I tried所以我试过了
from bs4 import BeautifulSoup
import urllib
readUrl = urllib.urlopen('test.html').read()
soup = BeautifulSoup(readUrl)
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"})
for type in candytype:
print type
This prints out the three candy types like this:这将打印出三种糖果类型,如下所示:
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>KitKat</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>OH Henry</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>Skittles</b></center>
</td>
</tr>
I thought I could group the candy "headers" (ie the tr elements whose bgcolor
set to #CEE3F6
) and then iterate on that basis, but I cannot figure out how to get further into the data.我以为我可以将糖果“标题”(即
bgcolor
设置为#CEE3F6
的 tr 元素) #CEE3F6
,然后在此基础上进行迭代,但我无法弄清楚如何进一步了解数据。
Any ideas?有任何想法吗?
Find all the rows, then iterate through them.找到所有行,然后遍历它们。 When you find one that contains the name of a candy (by the colour of the row), keep that name.
当您找到包含糖果名称的糖果(按行的颜色)时,请保留该名称。 Now identify the next siblings of that row.
现在确定该行的下一个兄弟姐妹。 Skip the first, which will be a heading but capture subsequent texts from the
td
elements.跳过第一个,这将是一个标题,但从
td
元素中捕获后续文本。 You know you've found the last sibling when you encounter the name of a different candy (again by the colour of the row).当您遇到不同糖果的名称时(再次通过行的颜色),您就知道找到了最后一个兄弟姐妹。
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('justTable.htm').read(), 'lxml')
>>> trs = soup.findAll('tr')
>>> for tr in trs:
... if 'bgcolor' in tr.attrs and tr.attrs['bgcolor']=='#CEE3F6':
... candy = tr.text.strip()
... first = True
... for sibs in tr.fetchNextSiblings():
... if first:
... first = False
... continue
... if 'bgcolor' in sibs.attrs and sibs.attrs['bgcolor']=='#CEE3F6':
... break
... [candy]+sibs.text.strip().split('\n')
...
['KitKat', 'Mall 1', '63', '0', '25']
['KitKat', 'Mall 2', '7', '0', '6']
['OH Henry', 'Warehouse 1', '195', '122', '30']
['OH Henry', 'Warehouse 2', '96', '76', '6']
['Skittles', 'Building 1', '120', '90', '5']
['Skittles', 'Building 2', '320', '236', '0']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.