简体   繁体   中英

Unable to parse a rating information from a webpage using requests

I tried to scrape a certain information from a webpage but failed miserably. The text I wish to grab is available in the page source but I still can't fetch it. This is the site address . I'm after the portion visible in the image as Not Rated .

在此处输入图像描述

Relevant html:

<div class="subtext">
                    Not Rated
    <span class="ghost">|</span>                    <time datetime="PT188M">
                        3h 8min
                    </time>
    <span class="ghost">|</span>
<a href="/search/title?genres=drama&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Drama</a>, 
<a href="/search/title?genres=musical&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Musical</a>, 
<a href="/search/title?genres=romance&amp;explore=title_type,genres&amp;ref_=tt_ov_inf">Romance</a>
    <span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a>            </div>

I've tried with:

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    rating = soup.select_one(".titleBar .subtext").next_element
    print(rating)

I get None using the script above.

Expected output:

Not Rated

How can I get the rating from that webpage?

There is a better way to getting info on the page. If you dump the html content returned by the request.

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    with open("response.html", "w", encoding=r.encoding) as file:
            file.write(r.text)

you will find a element <script type="application/ld+json"> which contains all the information about the movie.
Then, you simply get the element text, parse it as json, and use the json to extract the info you wanted.
here is a working example

import json
import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next  # Find the element <script type="application/ld+json"> and get it's content
    movie_data = json.loads(movie_data)  # parse the data to json
    content_rating = movie_data["contentRating"]  # get rating

IMDB is one of those webpages that makes it incredible easy to do webscraping and I love it. So what they do to make it easy for webscrapers is to put a script in the top of the html that contains the whole movie object in the format of JSON.

So to get all the relevant information and organize it you simply need to get the content of that single script tag, and convert it to JSON, then you can simply ask for the specific information like with a dictionary.

import requests
import json
from bs4 import BeautifulSoup

#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")

#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)

print(element['contentRating'])
#Outputs "Not Rated"


# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))

Note: Headers and session are not necessary in this example.

If you want to get correct version of HTML page, specify Accept-Language http header:

import requests
from bs4 import BeautifulSoup

link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"

with requests.Session() as s:
    s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
    s.headers['Accept-Language'] = 'en-US,en;q=0.5'  # <-- specify also this!
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    rating = soup.select_one(".titleBar .subtext").next_element
    print(rating)

Prints:

            Not Rated

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM