[英]python/beautifulsoup: find previous row with particular attribute
我正在用这样的表格抓取一个 html 文件:
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>
通过执行以下操作,我可以毫无问题地获取“tr class =“highlight1”行的值并将它们弹出到 csv 中:
soup = BeautifulSoup(open(r"/Users/user/Downloads/birds.html"), 'lxml')
english = [item.text for item in soup.select('tr[class] td:nth-of-type(1)')]
latin = [item.text for item in soup.select('tr[class] td:nth-of-type(2)')]
french = [item.text for item in soup.select('tr[class] td:nth-of-type(3)')]
status = [item.text for item in soup.select('tr[class] td:nth-of-type(4)')]
link = [item['href'] for item in soup.select('tr[class] a[href]')]
test = zip(english,latin,french,status,link)
with open('birdfile.csv', 'wt') as csvfile:
csv_out = csv.writer(csvfile)
csv_out.writerows(test)
我想要做的是为每一行获取“tr valign="bottom"” 中的值。 基本上,我知道如何使用 Beautifulsoup 中的 css 选择器前进和向下钻取,但我无法弄清楚如何向后前进并在每个“tr class="highlight1”之前选择“tr valign="bottom"” “” 。
我希望我的 csv 输出看起来像这样:
PASSERIFORMES: Cardinalidae,Summer Tanager,Piranga rubra...
PASSERIFORMES: Cardinalidae,Scarlet Tanager,Piranga olivacea...
PASSERIFORMES: Cardinalidae,Rose-breasted Grosbeak,Pheucticus ludovicianus...
PASSERIFORMES: Buntings,Indigo Bunting,Passerina cyanea...
PASSERIFORMES: Buntings,Dickcissel,Spiza americana...
我找不到任何这样的例子,我真的很感激任何帮助!
您可以简单地将您的表格读入熊猫,然后按照您认为合适的方式对其进行切片和切块:
import pandas as pd
langs = """your html above"""
df=pd.read_html(langs)
df[0]
输出(请原谅格式):
0 1 2 3
0 PASSERIFORMES: Cardinalidae PASSERIFORMES: Cardinalidae PASSERIFORMES: Cardinalidae NaN
1 Summer Tanager Piranga rubra Piranga vermillon Rare/Accidental
2 Scarlet Tanager Piranga olivacea Piranga écarlate Rare/Accidental
3 Rose-breasted Grosbeak Pheucticus ludovicianus Cardinal à poitrine rose Rare/Accidental
4 PASSERIFORMES: Buntings PASSERIFORMES: Buntings PASSERIFORMES: Buntings NaN
5 Indigo Bunting Passerina cyanea Passerin indigo Rare/Accidental
6 Dickcissel Spiza americana Dickcissel d'Amérique Rare/Accidental
如果你想要没有pandas
解决方案,你可以使用这个脚本:
from bs4 import BeautifulSoup
txt = '''
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>'''
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for tr in soup.select('tr:not(:has(td[colspan]))'):
all_data.append([
tr.find_previous('td', {'colspan': True}).get_text(strip=True),
*[td.get_text(strip=True) for td in tr.select('td')]
])
# print data to screen:
for row in all_data:
print(*row, sep=', ')
印刷:
PASSERIFORMES: Cardinalidae, Summer Tanager, Piranga rubra, Piranga vermillon, Rare/Accidental
PASSERIFORMES: Cardinalidae, Scarlet Tanager, Piranga olivacea, Piranga écarlate, Rare/Accidental
PASSERIFORMES: Cardinalidae, Rose-breasted Grosbeak, Pheucticus ludovicianus, Cardinal à poitrine rose, Rare/Accidental
PASSERIFORMES: Buntings, Indigo Bunting, Passerina cyanea, Passerin indigo, Rare/Accidental
PASSERIFORMES: Buntings, Dickcissel, Spiza americana, Dickcissel d'Amérique, Rare/Accidental
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.