为什么我的简单 python web 爬虫运行很慢？

Question

I am trying to scrape about 34,000 pages.我正在尝试抓取大约 34,000 页。 I calculated the time to find that it takes more than 5 seconds on average to request each page.我计算了时间，发现请求每个页面平均需要超过 5 秒。 Since I am directly scraping data from APIs, I only used the requests package.由于我直接从 API 中抓取数据，因此我只使用了请求 package。 Is there any way that I could speed up my crawler?有什么办法可以加快我的爬虫速度吗？ Or if it is not possible, how can I deploy the crawler to a server?或者如果不可能，我如何将爬虫部署到服务器上？

Here's some of my code:这是我的一些代码：

# Using python selenium to scrape sellers on shopee.co.id
# Crawl one seller -> Crawl all sellers in the list
# Sample URL: https://shopee.co.id/shop/38281755/search
# Sample API: https://shopee.co.id/api/v2/shop/get?shopid=38281755
import pandas as pd
import requests
import json
from datetime import datetime
import time

PATH_1 = '/Users/lixiangyi/FirstIntern/temp/seller_list.csv'
shop_list = pd.read_csv(PATH_1)
shop_ids = shop_list['shop'].tolist()
# print(seller_list)

# Downloading all APIs of shopee sellers:
api_links = []  # APIs of shops
item_links = []  # Links to click into
for shop_id in shop_ids:
    api_links.append('https://shopee.co.id/api/v2/shop/get?shopid=' + str(shop_id))
    item_links.append(
        f'https://shopee.co.id/api/v2/search_items/?by=pop&limit=10&match_id={shop_id}&newest=0&order=desc&page_type=shop&version=2'
    )
# print(api_links)


shop_names = []
shopid_list = []
founded_time = []
descriptions = []
i = 1

for api_link in api_links[0:100]:
    start_time = time.time()
    shop_info = requests.get(api_link)
    shopid_list.append(shop_info.text)
    print(i)
    i += 1
    end_time = time.time()
    print(end_time - start_time)

Answer 1

You should be trying to retrieve multiple URLs in parallel using either threading or the aiohttp package.您应该尝试使用线程或aiohttp package 并行检索多个 URL。 Using threading:使用线程：

Update更新

Since all your requests are going against the same website, it will be more efficient to use a requests.Session object for making your retrievals.由于您的所有请求都针对同一个网站，因此使用requests.Session object 进行检索会更有效。 However, regardless of how you go about retrieving these URLs, issuing too many requests from the same IP address to the same website in a short period of time could be interpreted as a Denial of Service attack.但是，无论您如何检索这些 URL，在短时间内从同一 IP 地址向同一网站发出过多请求都可能被解释为拒绝服务攻击。

import requests
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import time

api_links = [] # this will have been filled in
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

shopid_list = []

def retrieve_url(session, url):
    shop_info = session.get(url)
    return shop_info.text


NUM_THREADS = 75 # experiment with this value
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    with requests.Session() as session:
        session.headers = headers
        # session will be the first argument to retrieve_url:
        worker = partial(retrieve_url, session)
        start_time = time.time()
        for result in executor.map(worker, api_links):
            shopid_list.append(result)
        end_time = time.time()
        print(end_time - start_time)

Answer 2

use python urllib library使用 python urllib库

import urllib.request 
request_url = urllib.request.urlopen(some-url) 
print(request_url.read())

为什么我的简单 python web 爬虫运行很慢？

问题描述

2 个解决方案

解决方案1
2 2020-07-31 11:37:23

解决方案2
0 2020-07-31 10:41:16

为什么我的简单 python web 爬虫运行很慢？

问题描述

2 个解决方案

解决方案1 2 2020-07-31 11:37:23

解决方案2 0 2020-07-31 10:41:16

解决方案1
2 2020-07-31 11:37:23

解决方案2
0 2020-07-31 10:41:16