简体   繁体   English

为什么我的简单 python web 爬虫运行很慢?

[英]Why my simple python web crawler runs very slowly?

I am trying to scrape about 34,000 pages.我正在尝试抓取大约 34,000 页。 I calculated the time to find that it takes more than 5 seconds on average to request each page.我计算了时间,发现请求每个页面平均需要超过 5 秒。 Since I am directly scraping data from APIs, I only used the requests package.由于我直接从 API 中抓取数据,因此我只使用了请求 package。 Is there any way that I could speed up my crawler?有什么办法可以加快我的爬虫速度吗? Or if it is not possible, how can I deploy the crawler to a server?或者如果不可能,我如何将爬虫部署到服务器上?

Here's some of my code:这是我的一些代码:

# Using python selenium to scrape sellers on shopee.co.id
# Crawl one seller -> Crawl all sellers in the list
# Sample URL: https://shopee.co.id/shop/38281755/search
# Sample API: https://shopee.co.id/api/v2/shop/get?shopid=38281755
import pandas as pd
import requests
import json
from datetime import datetime
import time

PATH_1 = '/Users/lixiangyi/FirstIntern/temp/seller_list.csv'
shop_list = pd.read_csv(PATH_1)
shop_ids = shop_list['shop'].tolist()
# print(seller_list)

# Downloading all APIs of shopee sellers:
api_links = []  # APIs of shops
item_links = []  # Links to click into
for shop_id in shop_ids:
    api_links.append('https://shopee.co.id/api/v2/shop/get?shopid=' + str(shop_id))
    item_links.append(
        f'https://shopee.co.id/api/v2/search_items/?by=pop&limit=10&match_id={shop_id}&newest=0&order=desc&page_type=shop&version=2'
    )
# print(api_links)


shop_names = []
shopid_list = []
founded_time = []
descriptions = []
i = 1

for api_link in api_links[0:100]:
    start_time = time.time()
    shop_info = requests.get(api_link)
    shopid_list.append(shop_info.text)
    print(i)
    i += 1
    end_time = time.time()
    print(end_time - start_time)

You should be trying to retrieve multiple URLs in parallel using either threading or the aiohttp package.您应该尝试使用线程或aiohttp package 并行检索多个 URL。 Using threading:使用线程:

Update更新

Since all your requests are going against the same website, it will be more efficient to use a requests.Session object for making your retrievals.由于您的所有请求都针对同一个网站,因此使用requests.Session object 进行检索会更有效。 However, regardless of how you go about retrieving these URLs, issuing too many requests from the same IP address to the same website in a short period of time could be interpreted as a Denial of Service attack.但是,无论您如何检索这些 URL,在短时间内从同一 IP 地址向同一网站发出过多请求都可能被解释为拒绝服务攻击。

import requests
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import time

api_links = [] # this will have been filled in
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}

shopid_list = []

def retrieve_url(session, url):
    shop_info = session.get(url)
    return shop_info.text


NUM_THREADS = 75 # experiment with this value
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
    with requests.Session() as session:
        session.headers = headers
        # session will be the first argument to retrieve_url:
        worker = partial(retrieve_url, session)
        start_time = time.time()
        for result in executor.map(worker, api_links):
            shopid_list.append(result)
        end_time = time.time()
        print(end_time - start_time)

use python urllib library使用 python urllib

import urllib.request 
request_url = urllib.request.urlopen(some-url) 
print(request_url.read()) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM