简体   繁体   English

使用 Python 的网页抓取有时会获取结果有时会导致 HTTP 429

[英]Web Scraping using Python sometimes fetch result sometimes results in HTTP 429

I am trying to scrape reddit pages for the videos.我正在尝试为视频抓取 reddit 页面。 I am using python and beautiful soup to do the job.The following code sometimes return the result and sometimes not when I rerun the code.I'm not sure where i'm going wrong.我正在使用 python 和漂亮的汤来完成这项工作。下面的代码有时会返回结果,有时在我重新运行代码时不会。我不确定我哪里出错了。 Can someone help?有人可以帮忙吗? I'm a newbie to python so please bear with me.我是python的新手,所以请多多包涵。

import requests
from bs4 import BeautifulSoup


page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/')

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

if you do print (page) after your page = requests.get('https:/.........') , you'll see you get a successful <Response [200]>如果您在page = requests.get('https:/.........') print (page)之后print (page) ,您将看到您获得成功的<Response [200]>

But if you run it quickly again, you'll get the <Response [429]>但是如果你再次快速运行它,你会得到<Response [429]>

"The HTTP 429 Too Many Requests response status code indicates the user has sent too many requests in a given amount of time ("rate limiting")." “HTTP 429 Too Many Requests 响应状态代码表明用户在给定时间内发送了太多请求(“速率限制”)。” Source here来源在这里

Additonally, if you look at the html source, you'd see:此外,如果您查看 html 源代码,您会看到:

<h1>whoa there, pardner!</h1>
<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>
<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>
<p>please wait 6 second(s) and try again.</p>
<p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>

To add headers and avoid the 429 add in:要添加标题并避免 429 添加:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)

Full code:完整代码:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"}

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', headers=headers)
print (page)

soup = BeautifulSoup(page.text, 'html.parser')

source_tags = soup.find_all('source')

print(source_tags)

Output:输出:

<Response [200]>
[<source src="https://v.redd.it/et9so1j0z6a21/HLSPlaylist.m3u8" type="application/vnd.apple.mpegURL"/>]

and have had no issues rerunning multiple times after waiting a second or 2并且在等待一两秒后多次重新运行没有问题

I have tried below code and it is working for me at every request, Added timeout of 30 sec.我已经尝试过下面的代码,它在每次请求时都对我有用,增加了 30 秒的超时时间。

import requests
from bs4 import BeautifulSoup

page = requests.get('https://www.reddit.com/r/FortNiteBR/comments/afjbbp/just_trying_to_revive_my_buddy_and_then_he_got/', timeout=30)
if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'lxml')
    source_tags = soup.find_all('source')
    print(source_tags)
else:
    print(page.status_code, page)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM