[英]Web Scraping using Python sometimes fetch result sometimes results in HTTP 429
I am trying to scrape reddit pages for the videos.我正在尝试为视频抓取 reddit 页面。 I am using python and beautiful soup to do the job.The following code sometimes return the result and sometimes not when I rerun the code.I'm not sure where i'm going wrong.我正在使用 python 和漂亮的汤来完成这项工作。下面的代码有时会返回结果,有时在我重新运行代码时不会。我不确定我哪里出错了。 Can someone help?有人可以帮忙吗? I'm a newbie to python so please bear with me.我是python的新手,所以请多多包涵。
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
if you do print (page)
after your page = requests.get('https:/.........')
, you'll see you get a successful <Response [200]>
如果您在page = requests.get('https:/.........')
print (page)
之后print (page)
,您将看到您获得成功的<Response [200]>
But if you run it quickly again, you'll get the <Response [429]>
但是如果你再次快速运行它,你会得到<Response [429]>
"The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting")." “HTTP 429 Too Many Requests 响应状态代码表明用户在给定时间内发送了太多请求(“速率限制”)。” Source here来源在这里
Additonally, if you look at the html source, you'd see:此外,如果您查看 html 源代码,您会看到:
<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
more than <a href="http://github.com/reddit/reddit/wiki/API">one
request every two seconds</a> to avoid seeing this message.</p>
To add headers and avoid the 429 add in:要添加标题并避免 429 添加:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
Full code:完整代码:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)
soup = BeautifulSoup(page.text, 'html.parser')
source_tags = soup.find_all('source')
print(source_tags)
Output:输出:
<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]
and have had no issues rerunning multiple times after waiting a second or 2并且在等待一两秒后多次重新运行没有问题
I have tried below code and it is working for me at every request, Added timeout of 30 sec.我已经尝试过下面的代码,它在每次请求时都对我有用,增加了 30 秒的超时时间。
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
soup = BeautifulSoup(page.text, 'lxml')
source_tags = soup.find_all('source')
print(source_tags)
else:
print(page.status_code, page)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.