简体   繁体   English

如何从子类别中的所有页面获取所有产品(python,亚马逊)

[英]How to get all the products from all pages in the subcategory(python, amazon)

How can I get all the products from all the pages in the subcategory? 如何从子类别的所有页面中获取所有产品? I attached the program. 我附上了程序。 Now my program is getting only from the first page. 现在我的程序仅从第一页开始。 I would like to get all the products from that subcategory from all +400 pages so to go to the next page extract all products then to the next page etc. I will appreciate any help. 我想从所有+400页中获得该子类别中的所有产品,因此请转到下一页提取所有产品,然后再转到下一页,等等。我将不胜感激。

 # selenium imports from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import random PROXY ="88.157.149.250:8080"; chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--proxy-server=%s' % PROXY) # //a[starts-with(@href, 'https://www.amazon.com/')]/@href LINKS_XPATH = '//*[contains(@id,"result")]/div/div[3]/div[1]/a' browser = webdriver.Chrome(chrome_options=chrome_options) browser.get( 'https://www.amazon.com/s/ref=lp_11444071011_nr_p_8_1/132-3636705-4291947?rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011') links = browser.find_elements_by_xpath(LINKS_XPATH) for link in links: href = link.get_attribute('href') print(href) 

As you want to get huge piece of data, it's better to get it with direct HTTP request instead of navigating to each page with Selenium... 当您想要获取大量数据时,最好直接通过HTTP请求获取它,而不是使用Selenium导航到每个页面。

Try to iterate through all the pages and scrape required data as below 尝试遍历所有页面并按如下所示刮取所需数据

import requests
from lxml import html

page_counter = 1
links = []

while True:
    headers = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0"}
    url = "https://www.amazon.com/s/ref=sr_pg_{0}?rh=n%3A3375251%2Cn%3A!3375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011&page={0}&ie=UTF8&qid=1517398836".format(page_counter)
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        source = html.fromstring(response.content)
        links.extend(source.xpath('//*[contains(@id,"result")]/div/div[3]/div[1]/a/@href'))
        page_counter += 1
    else:
        break

print(links)

PS Check this ticket to be able to use proxy with requests library PS选中此票证即可将代理与requests库一起使用

# selenium imports
import random
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
import time


def list_all_items():
    # items = browser.find_elements_by_css_selector('.a-size-base.s-inline.s-access-title.a-text-normal')
    print "Start"
    item_list = []
    items = WebDriverWait(browser, 60).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".a-size-base.s-inline.s-access-title.a-text-normal")))
    print "items--->", items
    if items:
        for item in items:
            print item.text, "\n\n"
            item_list.append(item.text)
    #time.sleep(3)
    #next_button = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.ID, 'pagnNextString')))
    next_button = WebDriverWait(browser, 60).until(EC.element_to_be_clickable((By.ID, "pagnNextString"))) 
    print "next_button-->", next_button
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    print "____________SCROLL_DONE___"
    next_button.click()
    print "Click_done"
    list_all_items()
#     next_button = browser.find_element_by_id('pagnNextString')
#     next_button.click()

# ifpagnNextString
# https://www.amazon.com/s/ref=lp_11444071011_nr_p_8_1/132-3636705-4291947?rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011


PROXY = "88.157.149.250:8080";

chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('--proxy-server=%s' % PROXY)
# //a[starts-with(@href, 'https://www.amazon.com/')]/@href
LINKS_XPATH = '//*[contains(@id,"result")]/div/div[3]/div[1]/a'
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.maximize_window()
browser.get('https://www.amazon.com/s/ref=lp_11444071011_nr_p_8_1/132-3636705-4291947?rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A10971181011%2Cn%3A11444071011%2Cp_8%3A2229059011')

list_all_items()

i have made one method that will print list of items from all page and call it recursively and at end of method i have click on next button. 我做了一个方法,它将打印所有页面上的项目列表并递归调用它,在方法结束时,我单击了下一步按钮。 I did not give the break and exit condition i bellieve that you can manage it. 我没有给您休息和退出的条件,我相信您可以管理它。 The "list_all_items" method is the logic for do the thing that you required. “ list_all_items”方法是执行所需操作的逻辑。

also uncomment proxy part that i have commented. 也取消注释我已经评论过的代理部分。

Let me break up this problem in a few steps, so you understand what needs to be done here. 让我分几步解决这个问题,以便您了解此处需要做的事情。

First of all, you need to get all the products from a page. 首先,您需要从页面获取所有产品。

Then, you need to get all the pages and repeat the first step on each and every page. 然后,您需要获取所有页面并在每个页面上重复第一步。

Now I do not know Python, so I will try to do this as much in a generic way as I can. 现在,我不了解Python,因此我将尽我所能尽力做到这一点。

First, you need to create an int with value 0. After that you need to get the number of pages. 首先,您需要创建一个值为0的int。之后,您需要获取页面数。 To do so, check: 为此,请检查:

numberOfPagesString = browser.find_element_by_xpath('//span[@class='pagnDisabled']').text

numberOfPages = int(numberOfPagesString)

i = 0

Then you need to create a loop. 然后,您需要创建一个循环。 In the loop, you are going to increment the int where you set the value 0, to a maximum of 400. 在循环中,您将将设置值0的int增量为最大400。

So now your loop, each time the int is NOT equal to 400, is going to click on next page and get all products, and do what you want it to do. 因此,现在每次int不等于400时,您的循环都将单击下一页并获取所有产品,然后执行您想要的操作。 This will result in something like: 这将导致类似:

while i < numberOfPages **Here, as long as the value of i is less than 400, do this loop**

**code to get all products on page here**

**click on next page link**
browser.find_element_by_id('pagnNextString').click

i++ **here your i will become 1 after first page, 2 after second etc**

So to conclude, first thing you are doing is, determine how many pages are there on the page. 总结一下,要做的第一件事就是确定页面上有多少页。

Then you are going to create an int from that string you get back from the browser. 然后,您将从浏览器返回的字符串中创建一个int。

Then you create an int with value 0, which you are going to use to check if you have reached the amount of maximum pages, every time you iterate through the loop. 然后,您创建一个值为0的int,它将在每次循环时用于检查是否已达到最大页数。

After that you are going to first get all the products from the page (if you do not do that, it is going to skip the first page). 之后,您将首先从该页面获得所有产品(如果不这样做,它将跳过第一页)。

And at last, its going to click on the next page button. 最后,它要单击下一页按钮。

To finish it, you int i is going to get an increment with ++, so after every loop, it increases by 1. 要完成此操作,您需要先增加++的增量,因此在每次循环后,增量为1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM