[英]Is there any other way to extract data from dynamic website, rather than using selenium?
I am trying to extract the data from the website https://shop.nordstrom.com/ for all the products (like shirt, t-shirt and so on).我正在尝试从网站https://shop.nordstrom.com/中提取所有产品(如衬衫、T 恤等)的数据。 The page is dynamically loaded.
页面是动态加载的。 I know I can use selenium with headless browser, but that is also a time consuming process and looking up on the elements, having strange ID and class names, that is also not so promising.
我知道我可以将 selenium 与无头浏览器一起使用,但这也是一个耗时的过程并且查找元素,具有奇怪的 ID 和 class 名称,这也不太有希望。
So I thought of looking up on the Network tool, if I can find the path to the API, from where the data is being loaded (XHR Request).所以我想查找网络工具,如果我能找到 API 的路径,从那里加载数据(XHR 请求)。 But I could not find any thing helpful.
但我找不到任何有用的东西。 So is there a way to get the data from the website?
那么有没有办法从网站上获取数据呢?
If you don't want to use selenium
then the alternative is to use a web parser like bs4
or use simply the request
module.如果您不想使用
selenium
则替代方法是使用 web 解析器(如bs4
或仅使用request
模块。
You are on the right path in finding the call to the API
.您在找到对
API
的调用方面是正确的。 XHR
requests can be seen under the network
tab but the multitude of resources that appears makes it intricate to understand the requests being made. XHR
请求可以在network
选项卡下看到,但出现的大量资源使得理解正在发出的请求变得复杂。 A simple way around this is to use the following method:解决此问题的一种简单方法是使用以下方法:
Instead of
Network
tab go to theconsole
tab.而不是
Network
选项卡 go 到console
选项卡。 There click on thesettings
icon, and then tick just the optionLog XMLHTTPRequests
.单击
settings
图标,然后仅勾选选项Log XMLHTTPRequests
。
Now refresh the page and scroll down to initiate dynamic calls.现在刷新页面并向下滚动以启动动态调用。 You will now be able to see the logs of all
XHR
in a more clear way.您现在将能够以更清晰的方式查看所有
XHR
的日志。
For example例如
(index):29 Fetch finished loading: GET "** https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C24%2C24%2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null**" .
(索引):29 获取完成加载:GET "** https://shop.nordstrom.com/api/recs?page_type=home&placement=HP_SALE%2CHP_TOP_RECS%2CHP_CUST_HIS%2CHP_AFF_BRAND%2CHP_FTR&channel=web&bound=24%2C24%2C %2C6&apikey=9df15975b8cb98f775942f3b0d614157&session_id=0&shopper_id=df0fdb2bb2cf4965a344452cb42ce560&country_code=US&experiment_id=945b2363-c75d-4950-b255-194803a3ee2a&category_id=2375500&style_id=0%2C0%2C0%2C0&ts=1593768329863&url=https%3A%2F%2Fshop.nordstrom.com%2F&zip_code=null** ” 。
Making a get request to that URL
gives a bunch of Json
objects.向该
URL
发出 get 请求会得到一堆Json
对象。 You can now use this url
and others that you can derive to make the request straight to the URL
.您现在可以使用此
url
和其他您可以派生的直接向URL
提出请求。
See the answer here on how you can integrate the url
with a request module to fetch data.请参阅此处的答案,了解如何将
url
与请求模块集成以获取数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.