[英]How to separate rows of parsed html table with Python
(更新)我正在嘗試解析一些html表,但在划分行和列時遇到問題。 我正在嘗試提取一些html文件的表格:( http://www.sec.gov/Archives/edgar/data/5094/000095012313004020/h30303def14a.htm )
所以我得到了html,然后使用漂亮的湯給了我這些表: soup=BeautifulSoup(table)
然后我有一個用來分隔行和列的函數: data=collapsetable(soup)
我使用背景色來分隔行,但是我不確定如何分隔沒有背景色的表作為行分隔符。
def collapsetable(soup,combine_rows=True):
rows=[]
lastcolor=None
for tr in soup('tr'):
try:
color=tr['bgcolor']
except:
color=''
row=[]
for td in tr('th')+tr('td'):
try:
span=int(td['colspan'])
except:
span=1
try:
color=td['bgcolor']
except:
pass
datum=''.join([getdeepcontent(t) for t in td.contents])
row+=[datum]+['']*(span-1)
# Use Colors to find the row split
if color==lastcolor and combine_rows:
for i in range(len(row)):
if i>=len(rows[-1]):
rows[-1].append(row[i])
else:
rows[-1][i]+=' '+row[i]
else:
rows.append(row)
lastcolor=color
clean_rows(rows)
return rows
例如,我要在此文件中的html表是帶有“獨立受托人:”標題的表。 使用我的函數,我將獲得所有列,但不知道在何處分隔行。
例如,這是表之一的html部分:
<table border="0" width="100%" align="center" cellpadding="0" cellspacing="0" style="font-size: 8pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent"><!-- Table Width Row BEGIN --><tr style="font-size: 1pt" valign="bottom"> <td width="25%"> </td> <!-- colindex=01 type=maindata --> <td width="1%"> </td> <!-- colindex=02 type=gutter --> <td width="6%"> </td> <!-- colindex=02 type=maindata --> <td width="2%"> </td> <!-- colindex=03 type=gutter --> <td width="9%"> </td> <!-- colindex=03 type=maindata --> <td width="1%"> </td> <!-- colindex=04 type=gutter --> <td width="23%"> </td> <!-- colindex=04 type=maindata --> <td width="2%"> </td> <!-- colindex=05 type=gutter --> <td width="6%"> </td> <!-- colindex=05 type=maindata --> <td width="2%"> </td> <!-- colindex=06 type=gutter --> <td width="23%"> </td> <!-- colindex=06 type=maindata --></tr><!-- Table Width Row END --><!-- TableOutputHead --><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Number of<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Funds in<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Fund<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Position(s)<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Term of Office<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Complex<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Other Directorships<br /> </b></td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> <b>Name and Year of<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Held with<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>and Length of<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Principal Occupation(s)<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Overseen<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Held by Trustee<br /> </b></td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>Birth of Trustee</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>Funds</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>Time Served</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>During the Past Five Years</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>by Trustee</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>During the Past Five Years</b></div></td></tr><tr style="line-height: 3pt; font-size: 1pt"><td> </td></tr><!-- TableOutputBody --><tr valign="bottom"><td align="left" valign="top"> David C. Arch (1945)</td><td> </td><td nowrap="nowrap" align="left" valign="top"> Trustee</td><td> </td><td nowrap="nowrap" align="center" valign="top"> †</td><td> </td><td align="left" valign="top"> Chairman and Chief Executive Officer of Blistex Inc., a consumer health care products manufacturer. <br /> Formerly: Member of the Heartland Alliance Advisory Board, a nonprofit organization serving human needs based in Chicago.</td><td> </td><td nowrap="nowrap" align="center" valign="top"> 136</td><td> </td><td align="left" valign="top"> Trustee/Managing General Partner of funds in the Fund Complex. Board member of the Illinois Manufacturers’ Association. Member of the Board of Visitors, Institute for the Humanities, University of Michigan.</td></tr><tr valign="bottom" style="line-height: 6pt"><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><tr valign="bottom"><td align="left" valign="top"> Jerry D. Choate (1938)</td><td> </td><td nowrap="nowrap" align="left" valign="top"> Trustee</td><td> </td><td nowrap="nowrap" align="center" valign="top"> †</td><td> </td><td align="left" valign="top"> Retired. From 1995 to 1999, Chairman and Chief Executive Officer of the Allstate Corporation (“Allstate”) and Allstate Insurance Company. From 1994 to 1995, President and Chief Executive Officer of Allstate. Prior to 1994, various management positions at Allstate.</td><td> </td><td nowrap="nowrap" align="center" valign="top"> 13</td><td> </td><td align="left" valign="top"> Trustee/Managing General Partner of funds in the Fund Complex. Director since 1998 and member of the governance and nominating committee, executive committee, compensation and management development committee and equity award committee, of Amgen Inc., a biotechnological company. Director since 1999 and member of the nominating and governance committee and compensation and executive committee, of Valero Energy Corporation, a crude oil refining and marketing company. Previously, from 2006 to 2007, Director and member of the compensation committee and audit committee, of H&R Block, a tax preparation services company.</td></tr><tr valign="bottom" style="line-height: 6pt"><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><tr valign="bottom"><td align="left" valign="top"> Linda Hutton Heagy<sup style="font-size: 85%; vertical-align: top">1</sup> (1948)</td><td> </td><td nowrap="nowrap" align="left" valign="top"> Trustee</td><td> </td><td nowrap="nowrap" align="center" valign="top"> †</td><td> </td><td align="left" valign="top"> Retired. Prior to June 2008, Managing Partner of Heidrick & Struggles, the second largest global executive search firm, and from 2001-2004, Regional Managing Director of U.S. operations at Heidrick & Struggles. Prior to 1997, Managing Partner of Ray & Berndtson, Inc., an executive recruiting firm. Prior to 1995, Executive Vice President of ABN AMRO, N.A., a bank holding company, with oversight for treasury management operations including all non-credit product pricing. Prior to 1990, experience includes Executive Vice President of The Exchange National Bank with oversight of treasury management including capital markets operations, Vice President of Northern Trust Company and a trainee at Price Waterhouse.</td><td> </td><td nowrap="nowrap" align="center" valign="top"> 13</td><td> </td><td align="left" valign="top"> Trustee/Managing General Partner of funds in the Fund Complex. Prior to 2010, Trustee on the University of Chicago Medical Center Board, Vice Chair of the Board of the YMCA of Metropolitan Chicago and a member of the Women’s Board of the University of Chicago.</td></tr></table>
任何幫助深表感謝。
如果您沒有評論
print('{}: {}'.format(len(row), row))
在下面的代碼中,您會看到類似
11: ['', '', '', '', '', '', '', '', '', '', '']
11: ['', '', '', '', '', '', '', '', u'Number of', '', '']
11: ['', '', '', '', '', '', '', '', u'Funds in', '', '']
11: ['', '', '', '', '', '', '', '', u'Fund', '', '']
11: ['', '', u'Position(s)', '', u'Term of Office', '', '', '', u'Complex', '', u'Other Directorships']
11: [u'Name and Year of', '', u'Held with', '', u'and Length of', '', u'Principal Occupation(s)', '', u'Overseen', '', u'Held by Trustee']
11: [u'Birth of Trustee', '', u'Funds', '', u'Time Served', '', u'During the Past Five Years', '', u'by Trustee', '', u'During the Past Five Years']
1: ['']
11: [u'David C. Arch (1945)', '', u'Trustee', '', u'\x86', '', u'Chairman and Chief Executive Officer of Blistex Inc., a consumer\n health care products manufacturer.Formerly: Member of the Heartland Alliance Advisory Board, a\n nonprofit organization serving human needs based in Chicago.', '', u'136', '', u'Trustee/Managing General Partner of funds in the Fund Complex.\n Board member of the Illinois Manufacturers\x92 Association.\n Member of the Board of Visitors, Institute for the Humanities,\n University of Michigan.']
11: ['', '', '', '', '', '', '', '', '', '', '']
這表明報頭與行數據之間的間隔為長度1:
1: ['']
因此, bgcolor
使用bgcolor
標識要合並的行,不如將行的長度用作信號,表明所有先前的行都需要合並 。
import bs4 as bs
import urllib2
def collapse(table):
result = []
rows = []
for tr in table('tr'):
row = []
for td in tr('th') + tr('td'):
try:
span = int(td['colspan'])
except KeyError:
span = 1
datum = ''.join(td.stripped_strings)
row.extend([datum] + [''] * (span - 1))
if row:
# print('{}: {}'.format(len(row), row))
if len(row) > 1:
if any(row):
rows.append(row)
else:
result.extend(combine(rows))
rows = []
if rows:
result.extend(rows)
return result
def combine(rows):
return [[' '.join(col) for col in zip(*rows)]]
# url = 'http://www.sec.gov/Archives/edgar/data/5094/000095012313004020/h30303def14a.htm'
# soup = bs.BeautifulSoup(urllib2.urlopen(url))
# used for developing/debugging
with open('/tmp/def14a.htm', 'r') as f:
soup = bs.BeautifulSoup(f.read())
for table in soup.find_all('table'):
print(collapse(table))
print('-' * 80)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.