[英]How do I grab the string on the next line in HTML code following <span> tag with specific class and specific text?
I'm trying to scrape out some product specifications from some e-commerce website.我正在尝试从一些电子商务网站上抓取一些产品规格。 So I have a list of URLs to various products, I need my code to go to each (this part is easy) and scrape out the product specs I need.
所以我有一个各种产品的 URL 列表,我需要我的代码到 go 到每个(这部分很容易)并刮出我需要的产品规格。 I have been trying to use ParseHub — it works for some links but it does not for other.
我一直在尝试使用 ParseHub——它适用于某些链接,但不适用于其他链接。 My suspicion is, for example, 'Wheel diameter' changes its location every time so it ends up grabbing wrong spec value.
例如,我的怀疑是,“轮径”每次都会改变其位置,因此最终会获取错误的规格值。
One of such parts, for example, in HTML looks like this:例如,HTML 中的其中一个部分如下所示:
<div class="product-detail product-detail-custom-field">
<span class="product-detail-key">Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
What I think I could do is if I use BeautifulSoup and if I could somehow using smth like我想我能做的是如果我使用 BeautifulSoup 并且如果我能以某种方式使用 smth like
if soup.find("span", class_ = "product-detail-key").text.strip()=="Wheel Diameter":
*go to the next line and grab the string inside*
How can I code this?我该如何编码? I really apologize if my question sounds silly, pardon my ignorance, I'm pretty new to webscraping.
如果我的问题听起来很愚蠢,我真的很抱歉,请原谅我的无知,我对网络抓取很陌生。
You can use .find_next()
function:您可以使用
.find_next()
function:
from bs4 import BeautifulSoup
html_doc = """
<div class="product-detail product-detail-custom-field">
<span class="product-detail-key">Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
diameter = soup.find("span", text="Wheel Diameter").find_next("span").text
print(diameter)
Prints:印刷:
8 Inches
Or using CSS selector with +
:或使用带有
+
的 CSS 选择器:
diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + *').text
Using css selectors
you can simply chain / combinate your selection to be more strict.使用
css selectors
,您可以简单地链接/组合您的选择以更加严格。 In this case you select the <span>
contains your string and use adjacent sibling combinator
to get the next sibling <span>
.在这种情况下,您 select
<span>
包含您的字符串并使用adjacent sibling combinator
器来获取下一个兄弟<span>
。
diameter = soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span').text
or或者
diameter = soup.select_one('span.product-detail-key:-soup-contains("Wheel Diameter") + span').text
Note: To avoid AttributeError: 'NoneType' object has no attribute 'text'
, if element is not available you can check if it exists before calling text
method:注意:为避免
AttributeError: 'NoneType' object has no attribute 'text'
,如果元素不可用,您可以在调用text
方法之前检查它是否存在:
diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span')) else None
from bs4 import BeautifulSoup
html_doc = """
<div class="product-detail product-detail-custom-field">
<span class="product-detail-key">Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
diameter = e.text if (e := soup.select_one('.product-detail-key:-soup-contains("Wheel Diameter") + span')) else None
If you are using parsehub to collect the data:如果您使用 parsehub 收集数据:
<div class="product-detail product-detail-custom-field">
<span class="product-detail-key">Wheel Diameter</span>
<span data-product-custom-field="">8 Inches</span>
</div>
and you are after the innerText under你在下面的innerText之后
<span data-product-custom-field="">8 Inches</span>
Then what I would do is use a CSS selector to select the class of the first span.然后我要做的是使用 CSS 选择器到 select 第一个跨度的 class 。 Place a '+' just infront of it and it will select the next sibling element.
在它前面放置一个“+”,它将 select 成为下一个兄弟元素。
such as:如:
.product-detail-key +
your result:你的结果:
<span data-product-custom-field="">8 Inches</span>
Then all you have to do is choose to export the inner text, so under export type:然后你要做的就是选择导出内部文本,所以在导出类型下:
$e.text
This will scrape the following:这将刮掉以下内容:
8 Inches
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.