簡體   English   中英

使用 Selenium 和 Python 從 xpath 不斷變化的元素中抓取文本

[英]Scraping text from an element that the xpath keeps changing using Selenium and Python

我正在嘗試從不同網頁的列表中抓取信息。 我能夠從網站上抓取列表,並且可以很好地遍歷列表。 我遇到麻煩的地方是提取一些可能會或可能不會在每個頁面上找到的文本。 最初我使用的是 xpath 並且起初工作。 但隨后 xpath 發生了變化。 我以為我解決了這個問題,但我發現另一個 xpath 存在相同的信息。 現在我認為 xpath 在我嘗試使用它時不會起作用。 下面是三個看起來都相似但有 3 個不同 xpath 的示例。

<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
 ""Light, well-rounded peach, kiwi, barnyard Brett, overly ripe pineapple. Clean lactic acidity, balanced, with restrained funk; lemony and floral; medium body allows acidity to cut through and finish medium-dry. Herbal flavor through the finish, notes of white wine.""
</p>

Xpath:
//*[@id="article-body"]/div[3]/p[2]/text()

Selenium:
driver.find_element_by_xpath('//*[@id="article-body"]/div[3]/p[2]').text
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
"The appearance of this beer begs the name to have the word ‘cloud’ in it. Deep golden haze with a billowy head. Wonderful nose with a blend of citrus and tropical fruits. Compelling flavor profile filled with a blend of orange, peach, pineapple, and guava. Soft pillowy body with a more assertive finish that brings some bitterness to the table to scrub the palate for another sip. Slight hops burn. Pretty awesome beer for which we would gladly regularly reserve a spot in our fridge."
</p>

Xpath:
//*[@id="article-body"]/div[2]/p[2]/text()

Selenium:
driver.find_element_by_xpath('//*[@id="article-body"]/div[2]/p[2]').text
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
  <strong>Aroma:</strong>
   “Pumpkin notes and a touch of caramel malt with some clove, cinnamon, and nutmeg. This smells like a pumpkin-beer pie: crust, spice, warm, and some malt to make you think beer.”
</p>
<p>
  <strong>Flavor:</strong>
   “Where the nose was fairly mild, the flavor is much more interesting—a rich malt sweetness up front buffers the clove, nutmeg, and cinnamon. Notes of caramel and toffee with a bit of brown sugar, pumpkin, ginger, and vanilla. Hops bitterness balances nicely. More drinkable than one might expect—it’s not a big and heavy fall seasonal. Toasty crust lingers, reminds of pie. Finishes a bit sweet but nice for the style.”
</p>
<p>
  <strong>Overall:</strong>
   “Well-crafted pumpkin beer with a nice malt base and a compelling blend of spices. The spicing is bold but balanced. The spices and malt complexity are a delight. Everything works together to make a classic pumpkin beer.”
</p>

Xpaths:
//*[@id="article-body"]/div[3]/p[2]/text()
//*[@id="article-body"]/div[3]/p[3]/text()
//*[@id="article-body"]/div[3]/p[4]/text()

前兩個實例使用try/except很容易解決。 最后一個確實給我帶來了麻煩,因為它被分解為 3 個不同的<p>標簽。 我想要的是<h4 style="margin-bottom: 5px;">What our panel thought</4>之后的所有文本。 我還希望能夠將所有文本放在一個列表中,如下所示:

['Light, well-rounded peach, kiwi, barnyard Brett, overly ripe pineapple. Clean lactic acidity, balanced, with restrained funk; lemony and floral; medium body allows acidity to cut through and finish medium-dry. Herbal flavor through the finish, notes of white wine.', 
'The appearance of this beer begs the name to have the word ‘cloud’ in it. Deep golden haze with a billowy head. Wonderful nose with a blend of citrus and tropical fruits. Compelling flavor profile filled with a blend of orange, peach, pineapple, and guava. Soft pillowy body with a more assertive finish that brings some bitterness to the table to scrub the palate for another sip. Slight hops burn. Pretty awesome beer for which we would gladly regularly reserve a spot in our fridge.', 
'Pumpkin notes and a touch of caramel malt with some clove, cinnamon, and nutmeg. This smells like a pumpkin-beer pie: crust, spice, warm, and some malt to make you think beer. Where the nose was fairly mild, the flavor is much more interesting—a rich malt sweetness up front buffers the clove, nutmeg, and cinnamon. Notes of caramel and toffee with a bit of brown sugar, pumpkin, ginger, and vanilla. Hops bitterness balances nicely. More drinkable than one might expect—it’s not a big and heavy fall seasonal. Toasty crust lingers, reminds of pie. Finishes a bit sweet but nice for the style. Well-crafted pumpkin beer with a nice malt base and a compelling blend of spices. The spicing is bold but balanced. The spices and malt complexity are a delight. Everything works together to make a classic pumpkin beer.']

我猜我不能使用 xpath 但我對使用 selenium 進行網絡抓取還很陌生,所以我不確定在此之后的最佳行動方案。 任何建議,將不勝感激。

編輯:我應該補充一點,在//*[@id="article-body"]下有多個帶有<p>標簽的<h4>標簽。 我希望在<h4 style="margin-bottom: 5px;">What our panel thought</4>之后獲得特定的內容。

我能夠使用: driver.find_element_by_xpath('//h4[contains(text(),"What our panel thought")]//following-sibling::p').text

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM