简体   繁体   English

如何在<span>带有特定 class 和特定文本的标签后面的 HTML 代码中获取下一行的字符串?</span>

[英]How do I grab the string on the next line in HTML code following <span> tag with specific class and specific text?

I'm trying to scrape out some product specifications from some e-commerce website.我正在尝试从一些电子商务网站上抓取一些产品规格。 So I have a list of URLs to various products, I need my code to go to each (this part is easy) and scrape out the product specs I need.所以我有一个各种产品的 URL 列表,我需要我的代码到 go 到每个(这部分很容易)并刮出我需要的产品规格。 I have been trying to use ParseHub — it works for some links but it does not for other.我一直在尝试使用 ParseHub——它适用于某些链接,但不适用于其他链接。 My suspicion is, for example, 'Wheel diameter' changes its location every time so it ends up grabbing wrong spec value.例如,我的怀疑是,“轮径”每次都会改变其位置,因此最终会获取错误的规格值。

One of such parts, for example, in HTML looks like this:例如,HTML 中的其中一个部分如下所示:

<div class="product-detail product-detail-custom-field">
          <span class="product-detail-key">Wheel Diameter</span>
          <span data-product-custom-field="">8 Inches</span>
        </div>

What I think I could do is if I use BeautifulSoup and if I could somehow using smth like我想我能做的是如果我使用 BeautifulSoup 并且如果我能以某种方式使用 smth like

if soup.find("span", class_ = "product-detail-key").text.strip()=="Wheel Diameter":
                *go to the next line and grab the string inside*

How can I code this?我该如何编码? I really apologize if my question sounds silly, pardon my ignorance, I'm pretty new to webscraping.如果我的问题听起来很愚蠢,我真的很抱歉,请原谅我的无知,我对网络抓取很陌生。

You can use .find_next() function:您可以使用.find_next() function:

from bs4 import BeautifulSoup

html_doc = """
<div class="product-detail product-detail-custom-field">
  <span class="product-detail-key">Wheel Diameter</span>
  <span data-product-custom-field="">8 Inches</span>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

diameter = soup.find("span", text="Wheel Diameter").find_next("span").text
print(diameter)

Prints:印刷:

8 Inches

Or using CSS selector with + :或使用带有+的 CSS 选择器:

diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + *').text

Using css selectors you can simply chain / combinate your selection to be more strict.使用css selectors ,您可以简单地链接/组合您的选择以更加严格。 In this case you select the <span> contains your string and use adjacent sibling combinator to get the next sibling <span> .在这种情况下,您 select <span>包含您的字符串并使用adjacent sibling combinator器来获取下一个兄弟<span>

diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span').text

or或者

diameter = soup.select_one('span.product-detail-key:-soup-contains("Wheel Diameter") + span').text

Note: To avoid AttributeError: 'NoneType' object has no attribute 'text' , if element is not available you can check if it exists before calling text method:注意:为避免AttributeError: 'NoneType' object has no attribute 'text' ,如果元素不可用,您可以在调用text方法之前检查它是否存在:

diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span')) else None

Example例子

from bs4 import BeautifulSoup

html_doc = """
<div class="product-detail product-detail-custom-field">
  <span class="product-detail-key">Wheel Diameter</span>
  <span data-product-custom-field="">8 Inches</span>
</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")

diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span')) else None

If you are using parsehub to collect the data:如果您使用 parsehub 收集数据:

<div class="product-detail product-detail-custom-field">
      <span class="product-detail-key">Wheel Diameter</span>
      <span data-product-custom-field="">8 Inches</span>
    </div>

and you are after the innerText under你在下面的innerText之后

      <span data-product-custom-field="">8 Inches</span>

Then what I would do is use a CSS selector to select the class of the first span.然后我要做的是使用 CSS 选择器到 select 第一个跨度的 class 。 Place a '+' just infront of it and it will select the next sibling element.在它前面放置一个“+”,它将 select 成为下一个兄弟元素。

such as:如:

.product-detail-key +

your result:你的结果:

<span data-product-custom-field="">8 Inches</span>

Then all you have to do is choose to export the inner text, so under export type:然后你要做的就是选择导出内部文本,所以在导出类型下:

$e.text

This will scrape the following:这将刮掉以下内容:

8 Inches

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM