[英]Why my simple python web crawler runs very slowly?
I am trying to scrape about 34,000 pages.我正在尝试抓取大约 34,000 页。 I calculated the time to find that it takes more than 5 seconds on average to request each page.
我计算了时间,发现请求每个页面平均需要超过 5 秒。 Since I am directly scraping data from APIs, I only used the requests package.
由于我直接从 API 中抓取数据,因此我只使用了请求 package。 Is there any way that I could speed up my crawler?
有什么办法可以加快我的爬虫速度吗? Or if it is not possible, how can I deploy the crawler to a server?
或者如果不可能,我如何将爬虫部署到服务器上?
Here's some of my code:这是我的一些代码:
# Using python selenium to scrape sellers on shopee.co.id
# Crawl one seller -> Crawl all sellers in the list
# Sample URL: https://shopee.co.id/shop/38281755/search
# Sample API: https://shopee.co.id/api/v2/shop/get?shopid=38281755
import pandas as pd
import requests
import json
from datetime import datetime
import time
PATH_1 = '/Users/lixiangyi/FirstIntern/temp/seller_list.csv'
shop_list = pd.read_csv(PATH_1)
shop_ids = shop_list['shop'].tolist()
# print(seller_list)
# Downloading all APIs of shopee sellers:
api_links = [] # APIs of shops
item_links = [] # Links to click into
for shop_id in shop_ids:
api_links.append('https://shopee.co.id/api/v2/shop/get?shopid=' + str(shop_id))
item_links.append(
f'https://shopee.co.id/api/v2/search_items/?by=pop&limit=10&match_id={shop_id}&newest=0&order=desc&page_type=shop&version=2'
)
# print(api_links)
shop_names = []
shopid_list = []
founded_time = []
descriptions = []
i = 1
for api_link in api_links[0:100]:
start_time = time.time()
shop_info = requests.get(api_link)
shopid_list.append(shop_info.text)
print(i)
i += 1
end_time = time.time()
print(end_time - start_time)
You should be trying to retrieve multiple URLs in parallel using either threading or the aiohttp
package.您应该尝试使用线程或
aiohttp
package 并行检索多个 URL。 Using threading:使用线程:
Update更新
Since all your requests are going against the same website, it will be more efficient to use a requests.Session
object for making your retrievals.由于您的所有请求都针对同一个网站,因此使用
requests.Session
object 进行检索会更有效。 However, regardless of how you go about retrieving these URLs, issuing too many requests from the same IP address to the same website in a short period of time could be interpreted as a Denial of Service attack.但是,无论您如何检索这些 URL,在短时间内从同一 IP 地址向同一网站发出过多请求都可能被解释为拒绝服务攻击。
import requests
from concurrent.futures import ThreadPoolExecutor
from functools import partial
import time
api_links = [] # this will have been filled in
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'}
shopid_list = []
def retrieve_url(session, url):
shop_info = session.get(url)
return shop_info.text
NUM_THREADS = 75 # experiment with this value
with ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
with requests.Session() as session:
session.headers = headers
# session will be the first argument to retrieve_url:
worker = partial(retrieve_url, session)
start_time = time.time()
for result in executor.map(worker, api_links):
shopid_list.append(result)
end_time = time.time()
print(end_time - start_time)
use python urllib
library使用 python
urllib
库
import urllib.request
request_url = urllib.request.urlopen(some-url)
print(request_url.read())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.