[英]Need some information to extract from amazon page product python 3 beautifulsoup
我從網上創建了一個亞馬遜產品頁面的網絡爬蟲。 到目前為止,我提取了一些信息,但我需要更多信息。 我提取了 asin,即 SKU。
我需要:圖片、圖片網址、產品名稱、價格、品牌的簡短描述
我如何實現我的代碼來獲取這些信息?
#get the site
resp1 = requests.get(url).content
soup = bsoup(resp1, "html.parser")
html = soup.prettify('utf-8')
product_json = {}
#TEST
#scrape img
# This block of code will help extract the image of the item in dollars
for divs in soup.findAll('div', attrs={'id': 'rwImages_hidden'}):
for img_tag in divs.findAll('img', attrs={'style': 'display:none;'}):
product_json['img-url'] = img_tag['src']
break
# This block of code will help extract the Brand of the item
for divs in soup.findAll('div', attrs={'class': 'a-box-group'}):
try:
product_json['brand'] = divs['data-brand']
break
except:
pass
# This block of code will help extract the Prodcut Title of the item
for spans in soup.findAll('span', attrs={'id': 'productTitle'}):
name_of_product = spans.text.strip()
product_json['name'] = name_of_product
break
# This block of code will help extract the price of the item in dollars
for divs in soup.findAll('div'):
try:
price = str(divs['data-asin-price'])
product_json['price'] = '$' + price
break
except:
pass
# This block of code will help extract top specifications and details of the product
product_json['details'] = []
for ul_tags in soup.findAll('ul',
attrs={'class': 'a-unordered-list a-vertical a-spacing-none'
}):
for li_tags in ul_tags.findAll('li'):
for spans in li_tags.findAll('span',
attrs={'class': 'a-list-item'}, text=True,
recursive=False):
product_json['details'].append(spans.text.strip())
# This block of code will help extract the short reviews of the product
product_json['short-reviews'] = []
for a_tags in soup.findAll('a',
attrs={'class': 'a-size-base a-link-normal review-title a-color-base a-text-bold'
}):
short_review = a_tags.text.strip()
product_json['short-reviews'].append(short_review)
print(product_json)
讓我為您節省數小時解析 html 和處理亞馬遜 UI 更新的時間。
import requests
import json
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
params = (
('srs', '18637575011'),
('ie', 'UTF8'),
('qid', '1564753495'),
('sr', '8-1'),
)
resp = requests.get('https://www.amazon.com/Apple-iPhone-GSM-Unlocked-16GB/dp/B00YD547Q6/ref=lp_18637575011_1_1',
headers=headers, params=params)
index = resp.text.index('jQuery.parseJSON')
last_n = len(resp.text) - index - 18
text = resp.text[-last_n:]
json_line = text.split('\n')[0][:-3]
jsn = json.loads(json_line) # json object containing all the product data displayed on the page
他們正在返回一個 jquery 解析的 json 和他們的腳本標簽之一,其中包含您可能需要的所有數據。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.