繁体   English   中英

使用 Selenium 和 Python 从 xpath 不断变化的元素中抓取文本

[英]Scraping text from an element that the xpath keeps changing using Selenium and Python

我正在尝试从不同网页的列表中抓取信息。 我能够从网站上抓取列表,并且可以很好地遍历列表。 我遇到麻烦的地方是提取一些可能会或可能不会在每个页面上找到的文本。 最初我使用的是 xpath 并且起初工作。 但随后 xpath 发生了变化。 我以为我解决了这个问题,但我发现另一个 xpath 存在相同的信息。 现在我认为 xpath 在我尝试使用它时不会起作用。 下面是三个看起来都相似但有 3 个不同 xpath 的示例。

<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
 ""Light, well-rounded peach, kiwi, barnyard Brett, overly ripe pineapple. Clean lactic acidity, balanced, with restrained funk; lemony and floral; medium body allows acidity to cut through and finish medium-dry. Herbal flavor through the finish, notes of white wine.""
</p>

Xpath:
//*[@id="article-body"]/div[3]/p[2]/text()

Selenium:
driver.find_element_by_xpath('//*[@id="article-body"]/div[3]/p[2]').text
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
"The appearance of this beer begs the name to have the word ‘cloud’ in it. Deep golden haze with a billowy head. Wonderful nose with a blend of citrus and tropical fruits. Compelling flavor profile filled with a blend of orange, peach, pineapple, and guava. Soft pillowy body with a more assertive finish that brings some bitterness to the table to scrub the palate for another sip. Slight hops burn. Pretty awesome beer for which we would gladly regularly reserve a spot in our fridge."
</p>

Xpath:
//*[@id="article-body"]/div[2]/p[2]/text()

Selenium:
driver.find_element_by_xpath('//*[@id="article-body"]/div[2]/p[2]').text
<h4 style="margin-bottom: 5px;">What our panel thought</4>
<p>
  <strong>Aroma:</strong>
   “Pumpkin notes and a touch of caramel malt with some clove, cinnamon, and nutmeg. This smells like a pumpkin-beer pie: crust, spice, warm, and some malt to make you think beer.”
</p>
<p>
  <strong>Flavor:</strong>
   “Where the nose was fairly mild, the flavor is much more interesting—a rich malt sweetness up front buffers the clove, nutmeg, and cinnamon. Notes of caramel and toffee with a bit of brown sugar, pumpkin, ginger, and vanilla. Hops bitterness balances nicely. More drinkable than one might expect—it’s not a big and heavy fall seasonal. Toasty crust lingers, reminds of pie. Finishes a bit sweet but nice for the style.”
</p>
<p>
  <strong>Overall:</strong>
   “Well-crafted pumpkin beer with a nice malt base and a compelling blend of spices. The spicing is bold but balanced. The spices and malt complexity are a delight. Everything works together to make a classic pumpkin beer.”
</p>

Xpaths:
//*[@id="article-body"]/div[3]/p[2]/text()
//*[@id="article-body"]/div[3]/p[3]/text()
//*[@id="article-body"]/div[3]/p[4]/text()

前两个实例使用try/except很容易解决。 最后一个确实给我带来了麻烦,因为它被分解为 3 个不同的<p>标签。 我想要的是<h4 style="margin-bottom: 5px;">What our panel thought</4>之后的所有文本。 我还希望能够将所有文本放在一个列表中,如下所示:

['Light, well-rounded peach, kiwi, barnyard Brett, overly ripe pineapple. Clean lactic acidity, balanced, with restrained funk; lemony and floral; medium body allows acidity to cut through and finish medium-dry. Herbal flavor through the finish, notes of white wine.', 
'The appearance of this beer begs the name to have the word ‘cloud’ in it. Deep golden haze with a billowy head. Wonderful nose with a blend of citrus and tropical fruits. Compelling flavor profile filled with a blend of orange, peach, pineapple, and guava. Soft pillowy body with a more assertive finish that brings some bitterness to the table to scrub the palate for another sip. Slight hops burn. Pretty awesome beer for which we would gladly regularly reserve a spot in our fridge.', 
'Pumpkin notes and a touch of caramel malt with some clove, cinnamon, and nutmeg. This smells like a pumpkin-beer pie: crust, spice, warm, and some malt to make you think beer. Where the nose was fairly mild, the flavor is much more interesting—a rich malt sweetness up front buffers the clove, nutmeg, and cinnamon. Notes of caramel and toffee with a bit of brown sugar, pumpkin, ginger, and vanilla. Hops bitterness balances nicely. More drinkable than one might expect—it’s not a big and heavy fall seasonal. Toasty crust lingers, reminds of pie. Finishes a bit sweet but nice for the style. Well-crafted pumpkin beer with a nice malt base and a compelling blend of spices. The spicing is bold but balanced. The spices and malt complexity are a delight. Everything works together to make a classic pumpkin beer.']

我猜我不能使用 xpath 但我对使用 selenium 进行网络抓取还很陌生,所以我不确定在此之后的最佳行动方案。 任何建议,将不胜感激。

编辑:我应该补充一点,在//*[@id="article-body"]下有多个带有<p>标签的<h4>标签。 我希望在<h4 style="margin-bottom: 5px;">What our panel thought</4>之后获得特定的内容。

我能够使用: driver.find_element_by_xpath('//h4[contains(text(),"What our panel thought")]//following-sibling::p').text

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM