简体   繁体   English

如何从网页抓取中获取所有页面

[英]How to get all pages from webscraping

i am trying to get list of all shoes from all the pages from this website https://www.dickssportinggoods.com/f/all-mens-footwear but i dont know what else to write in my code.我正在尝试从该网站https://www.dickssportinggoods.com/f/all-mens-footwear的所有页面中获取所有鞋子的列表,但我不知道在我的代码中还要写什么。 Basically i would like to select a brand name shoes from all the pages from the website.基本上我想从网站的所有页面中找到 select 品牌鞋。 For example i would like to select New Balance shoes and i would like to print a list of all shoes by the name branc i selected.例如,我想 select New Balance 鞋子,我想打印我选择的名称 branc 的所有鞋子的列表。 Here is my code below这是我下面的代码

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
uClient = uReq(Url)
Page = uClient.read()
uClient.close()
page_soup = soup(Page, "html.parser")
for i in page_soup.findAll("div", {"class":"rs-facet-name-container"}):
    print(i.text)

You can click on filter button and check all the brand you want.您可以单击过滤器按钮并检查您想要的所有品牌。 You just have to do the driver.find element by xpath() If you use selenium you must know this.您只需driver.find element by xpath()如果您使用 selenium 您必须知道这一点。

That site is updating its element using js script so you won't be able to use beautifulsoup alone you have to use automation.该站点正在使用 js 脚本更新其元素,因此您将无法单独使用 beautifulsoup 您必须使用自动化。

This below code won't work because the element gets updated after few milliseconds.下面的代码将不起作用,因为元素会在几毫秒后更新。 It will initally show all the brands then it will update and show the chosen brand so use automation.它最初会显示所有品牌,然后会更新并显示所选品牌,因此使用自动化。

Code that fails:失败的代码:

from bs4 import BeautifulSoup as soup
import time
from urllib.request import urlopen as uReq
Url = 'https://www.dickssportinggoods.com/f/all-mens-footwear'
url_st = 'https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&filterFacets=X_BRAND'

for idx, br in enumerate(brands_name):
    if idx==0:
        url_st += '%3A'+ '%20'.join(br.split(' '))
    else: 
        url_st += '%2C' + '%20'.join(br.split(' '))

uClient = uReq(url_st)
time.sleep(4)
Page = uClient.read()
uClient.close()

page_soup = soup(Page, "html.parser") 
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
    print(match.text)

Code: (selenium + bs4)代码:(硒+ bs4)

from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from webdriver_manager.chrome import ChromeDriverManager

chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install())#, chrome_options=chrome_options)
driver.set_window_size(1024, 600)
driver.maximize_window()

brands_name = ['New Balance']

filter_facet ='filterFacets=X_BRAND'
for idx, br in enumerate(brands_name):
    if idx==0:
        filter_facet += '%3A'+ '%20'.join(br.split(' '))
    else: 
        filter_facet += '%2C' + '%20'.join(br.split(' '))

url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber=0&{filter_facet}"        
driver.get(url)
time.sleep(4)
page_soup = soup(driver.page_source, 'html.parser')  
elem = driver.find_element_by_class_name('close')
if elem:
    elem.click()
for match in page_soup.find_all('div', class_='rs_product_description d-block'):
    print(match.text)
    
page_num = page_soup.find_all('a', class_='rs-page-item')
pnum = [int(pn.text) for pn in page_num if pn.text!='']
if len(pnum)>=2:
    for pn in range(1, len(pnum)):
        url = f"https://www.dickssportinggoods.com/f/mens-athletic-shoes?pageNumber={pn}&{filter_facet}"
        driver.get(url)
        time.sleep(2)
        page_soup = soup(driver.page_source, "html.parser") 
        for match in page_soup.find_all('div', class_='rs_product_description d-block'):
            print(match.text)

New Balance Men's 410v6 Trail Running Shoes
New Balance Men's 623v3 Training Shoes
.
.
.
New Balance Men's Fresh Foam Beacon Running Shoes
New Balance Men's Fresh Foam Cruz v2 SockFit Running Shoes
New Balance Men's 470 Running Shoes
New Balance Men's 996v3 Tennis Shoes
New Balance Men's 1260 V7 Running Shoes
New Balance Men's Fresh Foam Beacon Running Shoes

I have commented out headerless chrome because when you open it you will get a dialog button after closing it you can fetch the products details.我已经注释掉了无标题 chrome,因为当你打开它时,你会在关闭它后得到一个对话框按钮,你可以获取产品详细信息。 In browserless automation you won't be able to do it( Can't answer this. not so good with selenium concepts)在无浏览器自动化中,您将无法做到(无法回答这个问题。selenium 概念不太好)

Don't forget to install: webdriver_manager using pip install webdriver_manager不要忘记安装: webdriver_manager using pip install webdriver_manager

The page is creating the links that you want using java script, you can not scrape that, you need to replicate the page requests, in this case the page is sending a post request:该页面正在使用 java 脚本创建您想要的链接,您无法抓取它,您需要复制页面请求,在这种情况下,页面正在发送发布请求:

Request URL: https://prod-catalog-product-api.dickssportinggoods.com/v1/search
Request Method: POST
Status Code: 200 OK
Remote Address: [2600:1400:d:696::25db]:443
Referrer Policy: no-referrer-when-downgrade

check the requests headers with inspect element tools from the browser to emulate the post request使用浏览器中的检查元素工具检查请求标头以模拟发布请求

This is the url where the post requests is being sent:这是发送 post 请求的 url:

https://prod-catalog-product-api.dickssportinggoods.com/v1/search

and this is the post info the browser is sending这是浏览器发送的帖子信息

{selectedCategory: "12301_1714863", selectedStore: "1406", selectedSort: 1,…}
isFamilyPage: true
pageNumber: 0
pageSize: 48
searchTypes: []
selectedCategory: "12301_1714863"
selectedFilters: {X_BRAND: ["New Balance"]}   #<--- this is the info that you want to get
selectedSort: 1
selectedStore: "1406"
storeId: 15108
totalCount: 3360

The page might also require headers, so make sure to emulate the requests sent by the browser.该页面可能还需要标头,因此请确保模拟浏览器发送的请求。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM