简体   繁体   English

如何抓取网站中的所有页面

[英]How to scrape all the pages in the website

https://www.bestbuy.com/site/promo/health-fitness-deals https://www.bestbuy.com/site/promo/health-fitness-deals

在此处输入图像描述

I want to loop through these 10 pages and scrape their names and hrefs Below is my code which only scrapes the 1st page continuously 10 times:我想遍历这 10 个页面并抓取它们的名称和 href 下面是我的代码,它只连续抓取第一页 10 次:

def name():
    for i in range(1, 11):
        tag = driver.find_elements_by_xpath('/html/body/div[4]/main/div[9]/div/div/div/div/div/div/div[2]/div[2]/div[3]/div/div[5]/ol/li[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div/h4')
        for a in tag:
            for name in a.find_elements_by_tag_name('a'):
                links = name.get_attribute("href")
                names = name.get_attribute('text')
                watches_name.append(names)
                watches_link.append(links)
                # print(watches_name)
                # print(watches_link)


name()

If you want to get elements from next pages then you have to click() on link >如果你想从下一页获取元素,那么你必须在链接上click() >

driver.find_element_by_css_selector('.sku-list-page-next').click()

Minimal working code with other changes.具有其他更改的最少工作代码。

I reduced xpath to something much simpler.我将 xpath 简化为更简单的东西。 And I keep name, link as pair because it is simpler to write in file CSV or in database or to filter and sort.我保留名称,链接成对,因为在文件 CSV 或数据库中写入或过滤和排序更简单。

I had to use longer sleep - sometimes my browser needs more time to update elements on page.我不得不使用更长的sleep时间——有时我的浏览器需要更多时间来更新页面上的元素。

from selenium import webdriver
import time

url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(2)

# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()

items = []

for page in range(1, 11):

    print('\n[DEBUG] wait 15 seconds to update page\n')
    time.sleep(15)

    print('\n--- page', page, '---\n')

    all_links = driver.find_elements_by_css_selector('#main-results h4 a')
    for a in all_links:
        link = a.get_attribute("href")
        name = a.get_attribute('text')
        items.append( [name, link] )
        print(name)

    print('\n[DEBUG] click next\n')
    driver.find_element_by_css_selector('.sku-list-page-next').click()
    
#print(items)

BTW:顺便提一句:

This method could be done with while True and some method to recognize if there is link > - and exit loop when there is no > .此方法可以使用while True和一些方法来识别是否存在链接> - 并在没有>时退出循环。 This way it could work with any number of pages.这样它就可以处理任意数量的页面。


Other method.其他方法。

When you manually visit few pages then you should see that second page has url with ?cp=2 , third with ?cp=3 , etc. so you could use it to load pages当您手动访问几个页面时,您应该看到第二页有 url 和?cp=2 ,第三个有?cp=3等,所以你可以用它来加载页面

driver.get(url + '?cp=' + str(page+1) )

Minimal working code.最少的工作代码。

from selenium import webdriver
import time

url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(2)

# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()

items = []

for page in range(1, 11):

    print('\n[DEBUG] wait 15 seconds to update page\n')
    time.sleep(15)

    print('\n--- page', page, '---\n')

    all_links = driver.find_elements_by_css_selector('#main-results h4 a')
    for a in all_links:
        link = a.get_attribute("href")
        name = a.get_attribute('text')
        items.append( [name, link] )
        print(name)

    print('\n[DEBUG] load next url\n')
    driver.get(url + '?cp=' + str(page+1) )
    
#print(items)

This method could also use while True and variable page to get any number of pages.此方法还可以使用while True和可变page来获取任意数量的页面。


EDIT:编辑:

Versions with while True带有while True的版本

from selenium import webdriver
import time

url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(2)

# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()

items = []

page = 1

while True:

    print('\n[DEBUG] wait 15 seconds to update page\n')
    time.sleep(15)

    print('\n--- page', page, '---\n')

    all_links = driver.find_elements_by_css_selector('#main-results h4 a')
    for a in all_links:
        link = a.get_attribute("href")
        name = a.get_attribute('text')
        items.append( [name, link] )
        print(name)

    page += 1

    print('\n[DEBUG] load next url\n')
    driver.get(url + '?cp=' + str(page) )

    if driver.title == 'Best Buy: Page Not Found':
        print('\n[DEBUG] exit loop\n')
        break
    
#print(items)

and

from selenium import webdriver
import time

url = 'https://www.bestbuy.com/site/promo/health-fitness-deals'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(2)

# page "Hello! Choose a Country" - selecting Unitet State flag
driver.find_element_by_class_name('us-link').click()

items = []

page = 1

while True:

    print('\n[DEBUG] wait 15 seconds to update page\n')
    time.sleep(15)

    print('\n--- page', page, '---\n')

    all_links = driver.find_elements_by_css_selector('#main-results h4 a')
    for a in all_links:
        link = a.get_attribute("href")
        name = a.get_attribute('text')
        items.append( [name, link] )
        print(name)

    page += 1
    
    print('\n[DEBUG] click next\n')
    item = driver.find_element_by_css_selector('.sku-list-page-next')
    if item.get_attribute("href"):
        item.click()
    else:
        print('\n[DEBUG] exit loop\n')
        break        
    
#print(items)

I guess if your code is working right, you will just need to click the pagination button.我想如果你的代码工作正常,你只需要点击分页按钮。 I found it can be located with the help of css selector ('#Caret_Right_Line_Sm') .我发现它可以在 css 选择器('#Caret_Right_Line_Sm')的帮助下找到。 Try adding this line to your function:尝试将此行添加到您的 function:

def name():
    for i in range(1, 11):
        tag = driver.find_elements_by_xpath('/html/body/div[4]/main/div[9]/div/div/div/div/div/div/div[2]/div[2]/div[3]/div/div[5]/ol/li[3]/div/div/div/div/div/div[2]/div[1]/div[2]/div/h4')
        for a in tag:
            for name in a.find_elements_by_tag_name('a'):
                links = name.get_attribute("href")
                names = name.get_attribute('text')
                watches_name.append(names)
                watches_link.append(links)
                # print(watches_name)
                # print(watches_link)
        driver.find_elements_by_css_selector('#Caret_Right_Line_Sm')[1].click()

name()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM