[英]I used “ time.sleep”, But twitter(tweepy) crawling error 429_python
[英]Can't solve HTTP Error 429 with time.sleep()
我需要抓取 Google 搜索結果鏈接。
但是,即使我將time.sleep()
放入我的代碼中,我仍然收到 HTTP 錯誤 429。
它適用於 50 - 100 行,然后給出錯誤 429。但是我需要抓取數百個條形碼鏈接。
我怎么解決這個問題?
import time
from itertools import chain
import pandas as pd
import requests
from bs4 import BeautifulSoup
barcode_df = pd.read_csv("C:/Users/emina/Coding_Projects/PycharmProjects/drug_interaction(pycharm)/barcodes.csv")
barcode_list2d = barcode_df.values.tolist()
barcode_list = list(chain.from_iterable(barcode_list2d)) # This is the list we'll iterate over
barcode_list = [x for x in barcode_list if type(x) == str]
barcode_list_deneme = barcode_list[0:20]
barcode_list1 = barcode_list[0:1000]
USER_AGENT = "some user agent"
headers = {"user-agent": USER_AGENT}
def append_links_to_csv(barcode):
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href'] # Parses search link
l.write(barcode + "," + link + "\n")
time.sleep(0.06)
else:
print(resp.status_code)
count = 0
l = open("links.csv", "a")
for barcode in barcode_list1:
query = barcode + "+" + "site:ilacabak.com"
url = f"https://google.com/search?q={query}"
resp = requests.get(url, headers=headers)
append_links_to_csv(barcode)
count += 1
print(count)
time.sleep(1.5)
if count % 100 == 0:
l.close()
l = open("links.csv", "a")
l.close()
檢查響應是否包含 Retry-After 標頭並在重試之前等待幾秒鍾。
您可以嘗試做的一件事是將user-agents
與Retry-After
標頭結合使用,這將指示如果響應狀態代碼為 429則在發出新請求之前等待多長時間,正如Sameer Naik已經建議的那樣。
例如:
import random, requests
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for _ in range(len(user_agent_list)):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
headers = {'User-Agent': user_agent}
requests.get('URL', headers=headers)
或者,您可以使用 SerpApi 的Google Search Engine Results API來忘記考慮它。 這是一個帶有免費計划的付費 API。 看看游樂場。
您的情況的不同之處在於您需要考慮要從結構化 JSON 中提取哪些數據,而不是找出繞過 Google(或其他搜索引擎)塊的方法。
或者如何從 HTML 中提取某些元素(特別是如果您需要的數據位於 JavaSript 但您不想使用瀏覽器自動化,例如selenium
或requests-html
)
要集成的示例代碼:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "Coffee",
"location": "Austin, Texas, United States",
"google_domain": "google.com",
"gl": "us",
"hl": "en"
# other query parameters
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(result['title'], result['link'], sep='\n')
# prints all results from the first page of organic results
免責聲明,我為 SerpApi 工作。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.