[英]Beautifulsoup requests.get() redirects from specified url
我正在使用
requests.get('https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists')
像這樣:
import requests
from bs4 import BeautifulSoup
url = 'https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")
print(urlsoup.find_all("a", attrs={"class": "large-3 medium-3 cell image"})[0])
但是,它一直在抓取內容,而不是從完整URL抓取,而是從主頁(' https://www.pastemagazine.com ')抓取。 我可以告訴您,因為我希望print語句能夠打印:
<a class="large-3 medium-3 cell image" href="/articles/2018/12/the-funniest-tweets-of-the-week-109.html" aria-label="">
<picture data-sizes="["(min-width: 40em)","(min-width: 64em)"]" class="lazyload" data-sources="["https://cdn.pastemagazine.com/www/opt/120/dogcrp-72x72.jpg","https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg","https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg"]">
<img alt="" />
</picture>
</a>
但是,它打印:
<a aria-label='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"' class="large-3 medium-3 cell image" href="/articles/2019/01/daily-dose-michael-chapman-feat-bridget-st-john-af.html">
<picture class="lazyload" data-sizes='["(min-width: 40em)","(min-width: 64em)"]' data-sources='["https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-72x72.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg"]'>
<img alt='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"'/>
</picture>
</a>
對應於首頁上的元素,而不是我想從搜索字詞中抓取的特定網址。 為什么它重定向到主頁? 我該如何阻止它呢?
如果您對重定向部分有把握,可以將allow_redirects
設置為False
以防止重定向。
r = requests.get(url, allow_redirects=False)
要獲得連接到推文的必需網址,您可以嘗試以下腳本。 事實證明,將標頭和cookie一起使用可以解決重定向問題。
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists"
with requests.Session() as s:
res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
for item in set([urljoin(url,item.get("href")) for item in soup.select("ul.articles a[href*='tweets-of-the-week']")]):
print(item)
為了使其變得更加簡單,請升級以下庫:
pip3 install lxml --upgrade
pip3 install beautifulsoup4 --upgrade
然后嘗試:
with requests.Session() as s:
res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select("a.noimage[href*='tweets-of-the-week']"):
print(urljoin(url,item.get("href")))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.