簡體   English   中英

使用.find()從帶有BS4的html頁面中提取兩個相同的'div'中的第二個

[英]Extracting with .find() the second of 2 identical 'div' from html page with BS4

我正在嘗試從aa湯元素中提取2個相同的“ div”中的第二個。 解析槽並使用.find()方法提取時,它排在最前面。 如果滿足某些條件,如何告訴腳本跳過第一個腳本並獲取下一個腳本? 下面是我要從中提取的html代碼。

<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>

這是我正在嘗試的代碼:

if '$' not in str(product.find('div', {'class': 'a-row a-size-base a-color-secondary'})):
    print('NOT IN')
    pass
    price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
    print(price)
else:
    price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
    print(price)

但是,結果仍然給了我這個:

NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>

而不是這樣:

<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div> 

有什么建議么?

您需要find_all然后索引到返回列表中,因為find僅返回第一個匹配項。 您可以使用select做同樣的事情。 使用bs4 4.7.1。 你可以使用:contains目標innerText由子(例如,元素的CONtv trial ),然后使用select_one如果第一場比賽想要的或select ,如果多個匹配。 您想先嘗試測試if None然后再嘗試訪問.text

from bs4 import BeautifulSoup as bs
import requests

html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)

用find_all循環

matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
    if '$' in str(item):
        print(item.text)

假設div現在直接位於<body>下,則可以使用標准的Python索引。 在您的真實代碼中,將選擇器中的body替換為適當的元素:

data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''

from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(data, 'lxml')

print(soup.select('body > div')[1].text.strip())

打印:

$0.00 with a CONtv trial on Prime Video Channels

注意select()>符號,這意味着我們希望所有<div> 直接<body>

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM