[英]Scraping specific tag and keyword, printing info associated with it using BeautifulSoup
I'm trying to scrape https://store.fabspy.com/collections/new-arrivals-beauty for the sapphire eye pencil product, and return the info associated with the product's id. 我正在尝试为蓝宝石眼线笔产品刮取https://store.fabspy.com/collections/new-arrivals-beauty ,并返回与该产品ID相关的信息。 So far I have: 到目前为止,我有:
from bs4 import BeautifulSoup
import urllib2
url = 'https://store.fabspy.com/collections/new-arrivals-beauty'
page = BeautifulSoup(url.read())
soup = BeautifulSoup((page))
tag = 'div class="product-content"'
if row in soup.html.body.findAll(tag):
data = row.findAll('id')
if data and 'sapphire' in data[0].text:
print data[4].text
The information I am trying to receive is the following ; 我想接收的信息如下:
<div class="product-content">
<div class="pc-inner">
<div data-handle="clematis-dewdrop-sparkling-eye-pencil-g7454c-sapphire"
data-target="#quick-shop-popup"
class="quick_shop quick-shop-button"
data-toggle="modal"
title="Quick View">
<span>+ Quick View</span>
<span class="json hide">
{
"id":8779050374,
"title":"Clematis - Dewdrop Sparkling Gel Eye Liner Pencil # G7454C**Sapphire**",
"handle":"clematis-dewdrop-sparkling-eye-pencil-g7454c-sapphire",
"description":"\u003cdiv\u003e\r\n\r\nGel Formula, Rich Colour, Matte Finish, Long-Wearing, Safe for Waterline\r\n\r\n\u003cbr\u003e\n\u003c\/div\u003e\u003cdiv\u003e\u003cbr\u003e\u003c\/div\u003e \u003cimg alt=\"\" src=\"\/\/i.imgur.com\/adW5MKl.jpg\"\u003e",
"published_at":"2016-10-17T20:15:40+08:00",
"created_at":"2016-10-17T20:15:40+08:00",
"vendor":"Clematis",
"type":"Latest,Beauty,New,Makeup,Best, Clematis, Eyes",
"tags":["Beauty","Best","Clematis","Eyes","Latest","Makeup","New"],
"price":4900,
"price_min":4900,
"price_max":4900,
"available":true,
"price_varies":false,
"compare_at_price":7900,
"compare_at_price_min":7900,
"compare_at_price_max":7900,
"compare_at_price_varies":false,
"variants":[{"id":31447937030", "title":"N\/A"]
}
Specifically the id
at the end. 特别是末尾的id
。 Please specify what tag my script should focus on to retrieve this info, and how I can keyword search for the sapphire
color within the script and its id
, thanks! 请指定我的脚本应集中在哪个标签上以检索此信息,以及如何在该脚本及其id
sapphire
关键字搜索sapphire
颜色,谢谢!
There are some errors in the existing code. 现有代码中存在一些错误。 I recommend using requests
instead of urllib2
. 我建议使用requests
而不是urllib2
。 I'm also using the re
and json
libraries. 我也在使用re
和json
库。 So this is what I would do in your case, (read the code for explanations). 因此,这就是我在您的情况下要做的(请阅读代码以获取解释)。
from bs4 import BeautifulSoup
import requests
import re
import json
# URL to scrape
url = 'https://store.fabspy.com/collections/new-arrivals-beauty'
# HTML data of the page
# You can add checks for 404 errors
soup = BeautifulSoup(requests.get(url).text, "lxml")
# Get a list of all elements having `sapphire` in the `data-handle` attribute
sapphire = soup.findAll(attrs={'data-handle': re.compile(r".*sapphire.*")})
# Take first element of this list (I checked, there is just one element)
sapphire = sapphire[0]
# Find class inside this element having JSON data. Taking just first element's text
json_text = sapphire.findAll(attrs={'class': "json"})[0].text
# Converting it to a dictionary
data = json.loads(json_text)
print data["id"]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.