简体   繁体   English

无法使用请求模块从静态网页中抓取信息

[英]Can't scrape information from a static webpage using requests module

I'm trying to fetch product title and it's description from a webpage using requests module.我正在尝试使用 requests 模块从网页中获取product title及其description The title and description appear to be static as they both are present in page source.标题和描述似乎是静态的,因为它们都存在于页面源代码中。 However, I failed to grab them using following attempt.但是,我未能通过以下尝试抓住它们。 The script throws AttributeError at this moment.脚本此时抛出AttributeError

import requests
from bs4 import BeautifulSoup

link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    product_title = soup.select_one("h1[itemProp='name']").text
    product_desc = soup.select_one("#product-page-selling-statement").text
    print(product_title,product_desc)

How can I scrape title and description from above pages using requests module?如何使用请求模块从上面的页面中抓取标题和描述?

The page is dynamic.页面是动态的。 go after the data from the api source:从 api 源获取数据:

import requests
import pandas as pd

api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()

df = pd.json_normalize(jsonData['products'].values())

print(df.iloc[0])

Output:输出:

id                                                       6638030-400
name                                  ANINE BING Women's Plaid Shirt
styleId                                                      6638030
styleNumber                                                         
colorCode                                                        400
colorName                                                       BLUE
brandLabelName                                            ANINE BING
hasFlatShot                                                     True
imageUrl           https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price                                                        $149.00
pathAlias          anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice                                                $149.00
productTypeLvl1                                                   12
productTypeLvl2                                                  216
isUmap                                                         False
Name: 0, dtype: object

When testing requests like these you should output the response to see what you're getting back.当测试像这样的请求时,你应该输出响应来看看你得到了什么。 Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers.最好使用 Postman 之类的东西(我认为 VSCode 现在有类似的功能)来设置 URL、标头、方法和参数,并且还可以查看带有标头的完整响应。 When you have everything working right, just convert it to python code.当一切正常时,只需将其转换为 python 代码。 Postman even has some 'export to code' functions for common languages. Postman 甚至有一些通用语言的“导出到代码”功能。

Anyways...无论如何...

I tried your request on Postman and got this response:我在 Postman 上尝试了您的请求并得到了以下回复: 简单的响应体

简单的响应头

Requests done from python vs a browser are the same thing.从 python 和浏览器完成的请求是一回事。 If the headers, URLs, and parameters are identical, they should receive identical responses.如果标头、URL 和参数相同,它们应该收到相同的响应。 So the next step is comparing the difference between your request and the request done by the browser:所以下一步就是比较你的请求和浏览器完成的请求的区别: 浏览器请求

So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.因此浏览器包含的一个或多个标头从服务器获得了良好的响应,但仅使用User-Agent是不够的。

I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(我会尝试确定哪些标头,但不幸的是,Nordstrom 检测到一些“异常活动”并且似乎阻止了我的 IP :( 被封锁 Probably due to sending an obvious handmade request.可能是由于发送了明显的手工请求。 I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.我认为这是我的 IP 被阻止了,因为我无法从任何浏览器访问该站点,即使在清除我的缓存之后也是如此。

So double-check that the same hasn't happened to you while working with your scraper.因此,请仔细检查您在使用刮刀时是否没有发生同样的情况。

Best of luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM