[英]Why my simple python web crawler runs very slowly?
我正在嘗試抓取大約 34,000 頁。 我計算了時間,發現請求每個頁面平均需要超過 5 秒。 由於我直接從 API 中抓取數據,因此我只使用了請求 package。 有什么辦法可以加快我的爬蟲速度嗎? 或者如果不可能,我如何將爬蟲部署到服務器上?
這是我的一些代碼:
# Using python selenium to scrape sellers on shopee.co.id
# Crawl one seller -> Crawl all sellers in the list
# Sample URL: https://shopee.co.id/shop/38281755/search
# Sample API: https://shopee.co.id/api/v2/shop/get?shopid=38281755
import pandas as pd
import requests
import json
from datetime import datetime
import time
PATH_1 = '/Users/lixiangyi/FirstIntern/temp/seller_list.csv'
shop_list = pd.read_csv(PATH_1)
shop_ids = shop_list['shop'].tolist()
# print(seller_list)
# Downloading all APIs of shopee sellers:
api_links = [] # APIs of shops
item_links = [] # Links to click into
for shop_id in shop_ids:
api_links.append('https://shopee.co.id/api/v2/shop/get?shopid=' + str(shop_id))
item_links.append(
f'https://shopee.co.id/api/v2/search_items/?by=pop&limit=10&match_id={shop_id}&newest=0&order=desc&page_type=shop&version=2'
)
# print(api_links)
shop_names = []
shopid_list = []
founded_time = []
descriptions = []
i = 1
for api_link in api_links[0:100]:
start_time = time.time()
shop_info = requests.get(api_link)
shopid_list.append(shop_info.text)
print(i)
i += 1
end_time = time.time()
print(end_time - start_time)
您應該嘗試使用線程或aiohttp
package 並行檢索多個 URL。 使用線程:
更新
由於您的所有請求都針對同一個網站,因此使用requests.Session
object 進行檢索會更有效。 但是,無論您如何檢索這些 URL,在短時間內從同一 IP 地址向同一網站發出過多請求都可能被解釋為拒絕服務攻擊。
import requests
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import time
api_links = [] # this will have been filled in
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
shopid_list = []
def retrieve_url(session, url):
shop_info = session.get(url)
return shop_info.text
NUM_THREADS = 75 # experiment with this value
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
with requests.Session() as session:
session.headers = headers
# session will be the first argument to retrieve_url:
worker = partial(retrieve_url, session)
start_time = time.time()
for result in executor.map(worker, api_links):
shopid_list.append(result)
end_time = time.time()
print(end_time - start_time)
使用 python urllib
庫
import urllib.request
request_url = urllib.request.urlopen(some-url)
print(request_url.read())
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.