简体   繁体   中英

Beautifulsoup requests.get() redirects from specified url

I'm using

requests.get('https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists')

like so:

import requests
from bs4 import BeautifulSoup
url = 'https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")
print(urlsoup.find_all("a", attrs={"class": "large-3 medium-3 cell image"})[0])

But it keeps scraping not from the full url, but just from the homepage (' https://www.pastemagazine.com '). I can tell because I expect the print statement to print:

<a class="large-3 medium-3 cell image" href="/articles/2018/12/the-funniest-tweets-of-the-week-109.html" aria-label="">
    <picture data-sizes="[&quot;(min-width: 40em)&quot;,&quot;(min-width: 64em)&quot;]" class="lazyload" data-sources="[&quot;https://cdn.pastemagazine.com/www/opt/120/dogcrp-72x72.jpg&quot;,&quot;https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg&quot;,&quot;https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg&quot;]">
      <img alt="" />
    </picture>
  </a>

But instead it prints:

<a aria-label='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"' class="large-3 medium-3 cell image" href="/articles/2019/01/daily-dose-michael-chapman-feat-bridget-st-john-af.html"> 
    <picture class="lazyload" data-sizes='["(min-width: 40em)","(min-width: 64em)"]' data-sources='["https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-72x72.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg"]'>
      <img alt='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"'/>
    </picture>
  </a>

Which corresponds to an element on the homepage, rather than the specific url I want to scrape from with the search terms. Why does it redirect to the homepage? How can I stop it from doing so?

如果您对重定向部分有把握,可以将allow_redirects设置为False以防止重定向。

r = requests.get(url, allow_redirects=False)

To get the required urls connected to tweets, you can try the following script. Turn out that using headers along with cookies solves the redirection issues.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists"

with requests.Session() as s:
    res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,'lxml')
    for item in set([urljoin(url,item.get("href")) for item in soup.select("ul.articles a[href*='tweets-of-the-week']")]):
        print(item)

Or to make it even easier, upgrade the following libraries:

pip3 install lxml --upgrade
pip3 install beautifulsoup4 --upgrade

And then try:

with requests.Session() as s:
    res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,'lxml')
    for item in soup.select("a.noimage[href*='tweets-of-the-week']"):
        print(urljoin(url,item.get("href")))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM