简体   繁体   中英

Scrape data from a website that URL doesn't change

I'm new to web scraping, but have enough command on requests, BeautifulSoup and Selenium that can do extract data from a website. Now the problem is, I'm trying to scrape data from the website that URL doesn't change when click on the page number for next page.

Page number in inspection

websiteURL ==> https://www.ellsworth.com/products/adhesives/

I also try the Google Developer tool but couldn't get the way. If someone guides me with Code that would be grateful. Google Developer show Get Request

Here is my Code

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import requests
itemproducts = pd.DataFrame()
driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get('https://www.ellsworth.com/products/adhesives/')
base_url = 'https://www.ellsworth.com'
html= driver.page_source
s = BeautifulSoup(html,'html.parser')
data = []

href_link = s.find_all('div',{'class':'results-products-item-image'})
for links in href_link:
    href_link_a = links.find('a')['href']
    data.append(base_url+href_link_a)
# url = 'https://www.ellsworth.com/products/adhesives/silicone/dow-838-silicone-adhesive-sealant-white-90-ml-tube/'

for c in data:
    driver.get(c)
    html_pro = driver.page_source
    soup = BeautifulSoup(html_pro,'html.parser')
    title = soup.find('span',{'itemprop':'name'}).text.strip()
    part_num = soup.find('span',{'itemprop':'sku'}).text.strip()
    manfacture = soup.find('span',{'class':'manuSku'}).text.strip()
    manfacture_ = manfacture.replace('Manufacturer SKU:', '').strip()
    pro_det = soup.find('div',{'class':'product-details'})
    p = pro_det.find_all('p')
    try:
        d = p[1].text.strip()    
        c = p.text.strip()
    except:
        pass
    table = pro_det.find('table',{'class':'table'})
    tr = table.find_all('td')
    typical = tr[1].text.strip()
    brand = tr[3].text.strip()
    color = tr[5].text.strip()
    image = soup.find('img',{'itemprop':'image'})['src']
    image_ = base_url + image
    png_url = title +('.jpg')
    img_data = requests.get(image_).content
    with open(png_url,'wb') as fh:
        fh.write(img_data)

    itemproducts=itemproducts.append({'Product Title':title,
                                     'Part Number':part_num,
                                     'SKU':manfacture_,
                                     'Description d':d,
                                     'Description c':c,
                                     'Typical':typical,
                                     'Brand':brand,
                                     'Color':color,
                                     'Image URL':image_},ignore_index=True)

The content of the page is rendered dynamically, but if you inspect the XHR tab under Network in the Developer Tool you can fetch the API request url. I've shortened the URL a bit, but it still works just fine.

Here's how you can get the list of the first 10 products from page 1:

import requests

start = 0
n_items = 10

api_request_url = f"https://www.ellsworth.com/api/catalogSearch/search?sEcho=1&iDisplayStart={start}&iDisplayLength={n_items}&DefaultCatalogNode=Adhesives&_=1497895052601"

data = requests.get(api_request_url).json()

print(f"Found: {data['iTotalRecords']} items.")

for item in data["aaData"]:
    print(item)

This gets you a nice JSON response with all the data for each product and that should get you started.

['Sauereisen Insa-Lute Adhesive Cement No. P-1 Powder Off-White 1 qt Can', 'P-1-INSA-LUTE-ADHESIVE', 'P-1 INSA-LUTE ADHESIVE', '$72.82', '/products/adhesives/ceramic/sauereisen-insa-lute-adhesive-cement-no.-p-1-powder-off-white-1-qt-can/', '/globalassets/catalogs/sauereisen-insa-lute-cement-no-p-1-off-white-1qt_170x170.jpg', 'Adhesives-Ceramic', '[{"qty":"1-2","price":"$72.82","customerPrice":"$72.82","eachPrice":"","custEachPrice":"","priceAmount":"72.820000000","customerPriceAmount":"72.820000000","currency":"USD"},{"qty":"3-15","price":"$67.62","customerPrice":"$67.62","eachPrice":"","custEachPrice":"","priceAmount":"67.620000000","customerPriceAmount":"67.620000000","currency":"USD"},{"qty":"16+","price":"$63.36","customerPrice":"$63.36","eachPrice":"","custEachPrice":"","priceAmount":"63.360000000","customerPriceAmount":"63.360000000","currency":"USD"}]', '', '', '', 'P1-Q', '1000', 'true', 'Presentation of packaged goods may vary. For special packaging requirements, please call (877) 454-9224', '', '', '']

If you want to get the next 10 items, you have to modify the iDisplayStart value to 10 . And if you want more items per request just change the iDisplayLength to say 20 .

In the demo, I substitute these values with start and n_items but you can easily automate that because the number of all items found comes with the response eg iTotalRecords .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM