繁体   English   中英

python/beautifulsoup:查找具有特定属性的前一行

[英]python/beautifulsoup: find previous row with particular attribute

我正在用这样的表格抓取一个 html 文件:

<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>

通过执行以下操作,我可以毫无问题地获取“tr class =“highlight1”行的值并将它们弹出到 csv 中:

soup = BeautifulSoup(open(r"/Users/user/Downloads/birds.html"), 'lxml')
english = [item.text for item in soup.select('tr[class] td:nth-of-type(1)')]
latin = [item.text for item in soup.select('tr[class] td:nth-of-type(2)')]
french = [item.text for item in soup.select('tr[class] td:nth-of-type(3)')]
status = [item.text for item in soup.select('tr[class] td:nth-of-type(4)')]
link = [item['href'] for item in soup.select('tr[class] a[href]')]

test = zip(english,latin,french,status,link)
with open('birdfile.csv', 'wt') as csvfile:
    csv_out = csv.writer(csvfile)
    csv_out.writerows(test)

我想要做的是为每一行获取“tr valign="bottom"” 中的值。 基本上,我知道如何使用 Beautifulsoup 中的 css 选择器前进和向下钻取,但我无法弄清楚如何向后前进并在每个“tr class="highlight1”之前选择“tr valign="bottom"” “”

我希望我的 csv 输出看起来像这样:

PASSERIFORMES: Cardinalidae,Summer Tanager,Piranga rubra...
PASSERIFORMES: Cardinalidae,Scarlet Tanager,Piranga olivacea...
PASSERIFORMES: Cardinalidae,Rose-breasted Grosbeak,Pheucticus ludovicianus...
PASSERIFORMES: Buntings,Indigo Bunting,Passerina cyanea...
PASSERIFORMES: Buntings,Dickcissel,Spiza americana...

我找不到任何这样的例子,我真的很感激任何帮助!

您可以简单地将您的表格读入熊猫,然后按照您认为合适的方式对其进行切片和切块:

import pandas as pd
langs = """your html above"""
df=pd.read_html(langs)
df[0]

输出(请原谅格式):

    0                               1                               2                      3
0   PASSERIFORMES: Cardinalidae     PASSERIFORMES: Cardinalidae     PASSERIFORMES: Cardinalidae     NaN
1   Summer Tanager                  Piranga rubra                   Piranga vermillon   Rare/Accidental
2   Scarlet Tanager                 Piranga olivacea                Piranga écarlate    Rare/Accidental
3   Rose-breasted Grosbeak          Pheucticus ludovicianus         Cardinal à poitrine rose    Rare/Accidental
4   PASSERIFORMES: Buntings         PASSERIFORMES: Buntings         PASSERIFORMES: Buntings     NaN
5   Indigo Bunting                  Passerina cyanea                Passerin indigo     Rare/Accidental
6   Dickcissel                      Spiza americana                 Dickcissel d'Amérique   Rare/Accidental

如果你想要没有pandas解决方案,你可以使用这个脚本:

from bs4 import BeautifulSoup


txt = '''
<table>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Cardinalidae</b></p></td></tr>
<tr class="highlight1"><td>Summer Tanager</td><td><a href="species.jsp?avibaseid=891798D9EFFE1F8D"><i>Piranga rubra</i></a></td><td>Piranga vermillon</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Scarlet Tanager</td><td><a href="species.jsp?avibaseid=4210163221C2E458"><i>Piranga olivacea</i></a></td><td>Piranga écarlate</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Rose-breasted Grosbeak</td><td><a href="species.jsp?avibaseid=7C2FCB13BAA660EE"><i>Pheucticus ludovicianus</i></a></td><td>Cardinal à poitrine rose</td><td>Rare/Accidental </td></tr>
<tr valign="bottom"><td colspan="3"><p> <br/><b>PASSERIFORMES: Buntings</b></p></td></tr>
<tr class="highlight1"><td>Indigo Bunting</td><td><a href="species.jsp?avibaseid=043F337AA25E7D97"><i>Passerina cyanea</i></a></td><td>Passerin indigo</td><td>Rare/Accidental </td></tr>
<tr class="highlight1"><td>Dickcissel</td><td><a href="species.jsp?avibaseid=592E58CE67D092DA"><i>Spiza americana</i></a></td><td>Dickcissel d'Amérique</td><td>Rare/Accidental </td></tr>
</table>'''

soup = BeautifulSoup(txt, 'html.parser')

all_data = []
for tr in soup.select('tr:not(:has(td[colspan]))'):
    all_data.append([
        tr.find_previous('td', {'colspan': True}).get_text(strip=True), 
        *[td.get_text(strip=True) for td in tr.select('td')] 
    ])

# print data to screen:
for row in all_data:
    print(*row, sep=', ')

印刷:

PASSERIFORMES: Cardinalidae, Summer Tanager, Piranga rubra, Piranga vermillon, Rare/Accidental
PASSERIFORMES: Cardinalidae, Scarlet Tanager, Piranga olivacea, Piranga écarlate, Rare/Accidental
PASSERIFORMES: Cardinalidae, Rose-breasted Grosbeak, Pheucticus ludovicianus, Cardinal à poitrine rose, Rare/Accidental
PASSERIFORMES: Buntings, Indigo Bunting, Passerina cyanea, Passerin indigo, Rare/Accidental
PASSERIFORMES: Buntings, Dickcissel, Spiza americana, Dickcissel d'Amérique, Rare/Accidental

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM