简体   繁体   English

使用Python和BeautifulSoup刮刮Amazon数据时出错

[英]Error scraping Amazon Data with Python and BeautifulSoup

I just started with Python and I have this strange behaviour that Python gives me an Error most of the time and sometimes it compiles my code correctly. 我刚开始使用Python,但是我有一个奇怪的行为,那就是Python大部分时间都会给我一个错误,有时它可以正确地编译我的代码。

import requests
from bs4 import BeautifulSoup

jblCharge4URL = 'https://www.amazon.de/JBL-Charge-Bluetooth-Lautsprecher-Schwarz-integrierter/dp/B07HGHRYCY/ref=sr_1_2_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&keywords=jbl+charge+4&qid=1562775856&s=gateway&sr=8-2-spons&psc=1'

def get_page(url):
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    return soup

def get_product_name(url):
    soup = get_page(url)
    try:
        title = soup.find(id="productTitle").get_text()
        print("SUCCESS")
    except AttributeError:
        print("ERROR")
while(True)
    print(get_product_name(jblCharge4URL))

Console Output: 控制台输出:

ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
**SUCCESS**  
None  
ERROR  
None  
**SUCCESS**  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None  
ERROR  
None

Thanks in Advance 提前致谢

I made a few adjustments to your code and this should get you back on the right track: 我对您的代码进行了一些调整,这将使您回到正确的轨道上:

import requests
from bs4 import BeautifulSoup

jblCharge4URL = 'https://www.amazon.de/JBL-Charge-Bluetooth-Lautsprecher-Schwarz-      integrierter/dp/B07HGHRYCY/ref=sr_1_2_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&  keywords=jbl+charge+4&qid=1562775856&s=gateway&sr=8-2-spons&psc=1'

def get_page(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def get_product_name(url):
    soup = get_page(url)
    try:
        title = soup.find(id="productTitle")
        print("SUCCESS")

    except AttributeError:
        print("ERROR")
    return(title)   
print(get_product_name(jblCharge4URL))

What headers are you using in page = requests.get(url, headers=headers) ? 什么headers你在使用page = requests.get(url, headers=headers) You would want something that tricks the server into believing that you are a genuine user and not a script. 您可能想要使服务器相信您是真实用户而不是脚本的内容。 I would recommend using something basic like 我建议使用类似

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

Also, you might want to print the value of variable soup in your exception as you debug this issue. 另外,在调试此问题时,您可能希望在异常中打印变量soup的值。 Printing soup will give you the HTML of the page and you can then dig inside the source code to understand where the issue lies. 打印soup将为您提供页面的HTML,然后您可以在源代码中进行挖掘以了解问题所在。

Apart from using requests and BeautifulSoup combination, you could also use the requests-html package to download your web page and parse the content at the same time. 除了使用requestsBeautifulSoup组合之外,您还可以使用requests-html包下载网页并同时解析内容。 An example of using requests-html would be: 使用request-html的示例为:

from requests_html import HTMLSession

url = r"https://www.amazon.de/JBL-Charge-Bluetooth-Lautsprecher-Schwarz-integrierter/dp/B07HGHRYCY/"

req = HTMLSession().get(url)
product_title = req.html.find("#productTitle", first=True)
print(product_title.text)  #JBL Charge 4 Bluetooth-Lautsprecher in Schwarz – Wasserfeste, portable Boombox mit integrierter Powerbank – Mit nur einer Akku-Ladung bis zu 20 Stunden kabellos Musik streamen

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM