简体   繁体   English

如何使用 selenium 从一页抓取多个网页?

[英]How to scrape multiple webpages stemming from one page using selenium?

recently I have been attempting to scrape a large amount of pricing from a website by starting with one page that has each item's page linked to the starting page.最近,我一直在尝试从网站上获取大量定价,方法是从一个页面开始,每个页面都链接到起始页面。 I was hoping to run a script which allowed me to click a box for a certain item, scrape that item's pricing and description, and then go back to the starting page and continue in that loop.我希望运行一个脚本,允许我单击某个项目的框,抓取该项目的定价和描述,然后 go 返回起始页面并继续该循环。 However, there was an obvious problem which I ran into after scraping the first item.但是,我在抓取第一个项目后遇到了一个明显的问题。 After going back to the starting page, the containers are not defined and thus a stale element error is given which breaks the loop and prevents me from getting the rest of the items.返回到起始页面后,未定义容器,因此给出了一个过时的元素错误,该错误会中断循环并阻止我获取项目的 rest。 This is the sample code I used, hoping to scrape all the items one after another.这是我使用的示例代码,希望能将所有项目一个接一个地刮掉。

driver = webdriver.Chrome(r'C:\Users\Hank\Desktop\chromedriver_win32\chromedriver.exe')
driver.get('https://steamcommunity.com/market/search?q=&category_440_Collection%5B%5D=any&category_440_Type%5B%5D=tag_misc&category_440_Quality%5B%5D=tag_rarity4&appid=440#p1_price_asc')

import time

time.sleep(5)

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import StaleElementReferenceException

action = ActionChains(driver)

next_button=wait(driver, 10).until(EC.element_to_be_clickable((By.ID,'searchResults_btn_next')))

def prices_and_effects():
    action = ActionChains(driver)
    imgs = wait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, 'img.market_listing_item_img.economy_item_hoverable')))
    for img in imgs:
        ActionChains(driver).move_to_element(img).perform()
        print([my_element.text for my_element in wait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.item_desc_description div.item_desc_descriptors#hover_item_descriptors div.descriptor")))])
    prices = driver.find_elements(By.CSS_SELECTOR, 'span.market_listing_price.market_listing_price_with_fee')
    for price in prices:
        print(price.text)

def unusuals():
    unusuals = wait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.market_listing_row.market_recent_listing_row.market_listing_searchresult')))
    for unusual in unusuals:
        unusual.click()
        time.sleep(2)
        next_button=wait(driver, 10).until(EC.element_to_be_clickable((By.ID,'searchResults_btn_next')))
        next_button.click()
        time.sleep(2)
        back_button=wait(driver, 10).until(EC.element_to_be_clickable((By.ID,'searchResults_btn_prev')))
        back_button.click()
        time.sleep(2)
        prices_and_effects()
        ref_val = wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'searchResults_start'))).text
        while next_button.get_attribute('class') == 'pagebtn':
            next_button.click()
            wait(driver, 10).until(lambda driver: wait(driver, 10).until(EC.presence_of_element_located((By.ID,'searchResults_start'))).text != ref_val)
            prices_and_effects()
            ref_val = wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'searchResults_start'))).text
        time.sleep(2)
        driver.execute_script("window.history.go(-1)")
        time.sleep(2)
        unusuals = wait(driver, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.market_listing_row.market_recent_listing_row.market_listing_searchresult')))

unusuals()

After scraping the first item, however, which it does successfully, it goes back to the page and throws a stale element error.然而,在成功抓取第一个项目后,它会返回页面并引发过时元素错误。 The error makes sense to me, but is there any way to circumvent this so I can keep the functions and use the loop?这个错误对我来说很有意义,但是有什么办法可以规避这个错误,以便我可以保留函数并使用循环?

Selenium is overkill for this. Selenium 对此太过分了。 You can just imitate HTTP GET requests to the same APIs that your browser makes requests to when rendering the page.您可以模仿 HTTP GET 请求到浏览器在呈现页面时发出请求的相同 API。 Just be careful that you don't make more than 100,000 daily requests to the Steam API.请注意,您每天向 Steam API 发出的请求不要超过 100,000 个。 Also, if the requests happen too frequently, the Steam servers extrapolate and will stop responding to requests until a certain timeout has expired, even if you aren't anywhere near reaching the 100,000 daily requests limit - that's why I added some time.sleep s for good measure after each request using the item_id .此外,如果请求发生得太频繁,Steam 服务器会推断并停止响应请求,直到某个超时到期,即使您还没有达到每天 100,000 个请求的限制 - 这就是我添加一些time.sleep的原因在每次请求后使用item_id进行良好测量。

First, you make a request to the market listings page - the one that shows all the items.首先,您向市场列表页面发出请求 - 显示所有项目的页面。 Then, for each item in our list of results, we extract the item's name, and we make a request to that item's overview page and extract the item's item_id from the HTML using a regular expression.然后,对于结果列表中的每个项目,我们提取项目的名称,并向该项目的概览页面发出请求,并使用正则表达式从 HTML 中提取项目的item_id Then, we make another request to https://steamcommunity.com/market/itemordershistogram to get the most recent price information for that item.然后,我们再次向https://steamcommunity.com/market/itemordershistogram请求获取该商品的最新价格信息。

Feel free to play around with the start and count query string parameters in the param dictionary.随意使用param字典中的startcount查询字符串参数。 Right now it just prints information for the first ten items:现在它只打印前十项的信息:

def main():

    import requests
    from bs4 import BeautifulSoup
    import re
    import time

    url = "https://steamcommunity.com/market/search/render/"

    params = {
        "query": "",
        "start": "0",
        "count": "10",
        "search_descriptions": "0",
        "sort_column": "price",
        "sort_dir": "asc",
        "appid": "440",
        "category_440_Collection[]": "any",
        "category_440_Type[]": "tag_misc",
        "category_440_Quality[]": "tag_rarity4"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()
    time.sleep(1)

    item_id_pattern = r"Market_LoadOrderSpread\( (?P<item_id>\d+) \)"

    soup = BeautifulSoup(response.json()["results_html"], "html.parser")

    for result in soup.select("a.market_listing_row_link"):
        url = result["href"]
        product_name = result.select_one("div")["data-hash-name"]
        try:
            response = requests.get(url)
            response.raise_for_status()
            time.sleep(1)

            item_id_match = re.search(item_id_pattern, response.text)
            assert item_id_match is not None
        except:
            print(f"Skipping {product_name}")
            continue

        url = "https://steamcommunity.com/market/itemordershistogram"

        params = {
            "country": "DE",
            "language": "english",
            "currency": "1",
            "item_nameid": item_id_match.group("item_id"),
            "two_factor": "0"
        }

        response = requests.get(url, params=params)
        response.raise_for_status()
        time.sleep(1)

        data = response.json()
        highest_buy_order = float(data["highest_buy_order"]) / 100.0

        print(f"The current highest buy order for \"{product_name}\" is ${highest_buy_order}")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output: Output:

The current highest buy order for "Unusual Cadaver's Cranium" is $12.16
The current highest buy order for "Unusual Backbreaker's Skullcracker" is $13.85
The current highest buy order for "Unusual Hard Counter" is $13.04
The current highest buy order for "Unusual Spiky Viking" is $14.26
The current highest buy order for "Unusual Carouser's Capotain" is $12.72
The current highest buy order for "Unusual Cyborg Stunt Helmet" is $12.89
The current highest buy order for "Unusual Stately Steel Toe" is $12.67
The current highest buy order for "Unusual Bloke's Bucket Hat" is $12.71
The current highest buy order for "Unusual Pugilist's Protector" is $12.94
The current highest buy order for "Unusual Shooter's Sola Topi" is $13.25
>>> 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM