简体   繁体   English

需要一些信息从亚马逊页面产品 python 3 beautifulsoup 中提取

[英]Need some information to extract from amazon page product python 3 beautifulsoup

I create from the web a web scraper of the amazon's product page.我从网上创建了一个亚马逊产品页面的网络爬虫。 Until now i extract some information, but i need more.到目前为止,我提取了一些信息,但我需要更多信息。 I extracted the asin, the SKU.我提取了 asin,即 SKU。

I need the: image, the image url, the product title, the price, the short description the brand我需要:图片、图片网址、产品名称、价格、品牌的简短描述

How can i implement my code to obtain this informations?我如何实现我的代码来获取这些信息?

 #get the site
        resp1 = requests.get(url).content
        soup = bsoup(resp1, "html.parser")
        html = soup.prettify('utf-8')
        product_json = {}

    #TEST
        #scrape img
        # This block of code will help extract the image of the item in dollars

        for divs in soup.findAll('div', attrs={'id': 'rwImages_hidden'}):
            for img_tag in divs.findAll('img', attrs={'style': 'display:none;'}):
                product_json['img-url'] = img_tag['src']
                break

        # This block of code will help extract the Brand of the item
        for divs in soup.findAll('div', attrs={'class': 'a-box-group'}):
            try:
                product_json['brand'] = divs['data-brand']
                break
            except:
                pass

        # This block of code will help extract the Prodcut Title of the item
        for spans in soup.findAll('span', attrs={'id': 'productTitle'}):
            name_of_product = spans.text.strip()
            product_json['name'] = name_of_product
            break

        # This block of code will help extract the price of the item in dollars
        for divs in soup.findAll('div'):
            try:
                price = str(divs['data-asin-price'])
                product_json['price'] = '$' + price
                break
            except:
                pass

        # This block of code will help extract top specifications and details of the product
        product_json['details'] = []
        for ul_tags in soup.findAll('ul',
                                    attrs={'class': 'a-unordered-list a-vertical a-spacing-none'
                                    }):
            for li_tags in ul_tags.findAll('li'):
                for spans in li_tags.findAll('span',
                        attrs={'class': 'a-list-item'}, text=True,
                        recursive=False):
                    product_json['details'].append(spans.text.strip())

        # This block of code will help extract the short reviews of the product

        product_json['short-reviews'] = []
        for a_tags in soup.findAll('a',
                                   attrs={'class': 'a-size-base a-link-normal review-title a-color-base a-text-bold'
                                   }):
            short_review = a_tags.text.strip()
            product_json['short-reviews'].append(short_review)
        print(product_json)

Let me save you hours and hours of parsing htmls and dealing with amazon UI updates.让我为您节省数小时解析 html 和处理亚马逊 UI 更新的时间。

import requests
import json

headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}

params = (
    ('srs', '18637575011'),
    ('ie', 'UTF8'),
    ('qid', '1564753495'),
    ('sr', '8-1'),
)

resp = requests.get('https://www.amazon.com/Apple-iPhone-GSM-Unlocked-16GB/dp/B00YD547Q6/ref=lp_18637575011_1_1',
                    headers=headers, params=params)

index = resp.text.index('jQuery.parseJSON')
last_n = len(resp.text) - index - 18
text = resp.text[-last_n:]

json_line = text.split('\n')[0][:-3]
jsn = json.loads(json_line) # json object containing all the product data displayed on the page

They are returning a jquery parsed json with one of their script tags, that contains all the data you might need.他们正在返回一个 jquery 解析的 json 和他们的脚本标签之一,其中包含您可能需要的所有数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从亚马逊产品页面中提取asin - how to extract asin from an amazon product page 我需要使用html页面中的python提取一些数据 - I need to extract some data using python from a html page 是否在Amazon Product页面上提取了BeautifulSoup的“同时购买了此商品的顾客”部分? - Having BeautifulSoup extract “Customers Who Bought This Item Also Bought” part of Amazon Product page? 在 Selenium/Python 中从亚马逊产品页面的图像中提取所有 src 属性 - Extract all the src attribute from the images of Amazon Product Page in Selenium/Python Python:需要使用正则表达式从 html 页面提取标签内容,但不是 BeautifulSoup - Python: Need to extract tag content from html page using regex, but not BeautifulSoup 使用 Python 从 BeautifulSoup bs4.element.Tag 中提取信息 - Extract Information from BeautifulSoup bs4.element.Tag using Python 在python中使用宽瓶从不同地区的亚马逊提取产品价格 - using bottlenose in python to extract product price from Amazon in different locale 使用BeautifulSoup提取div中的页面信息 - Using BeautifulSoup to extract page information within div Python 3 从页面中提取 html 信息 - Python 3 extract html information from page 使用 python 和请求提取嵌入在 web 页面中的 pdf 中的一些信息 - Extract some information in a pdf embedded in a web page using python and requests
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM