简体   繁体   中英

Can't find existing element with beautifulsoup and requests

When I want to scrape all data from https://www.britannica.com/search?query=world+war+2 I can't find all the elements. I am specifically looking for everything inside the div element with the class: search-feature-container (it's the content inside the info box at the top) , but when I scrape it just says that it found None. This is my code:

import requests
from bs4 import BeautifulSoup

def scrape_britannica(product_name):
    ### SETUP ###
    URL_raw = 'https://www.britannica.com/search?query=' + product_name
    URL = URL_raw.strip().replace(" ", "+")
    ## gets the html from the url
    try:
        page = requests.get(URL)
    except:
        print("Could not find URL..")

    ## a way to come around scrape blocking
    soup = BeautifulSoup(page.content, 'html.parser')

    parent = soup.find("div", {"class": "search-feature-container"})
    print(parent)

scrape_britannica('carl barks')

I guess it has something to do with it not loading in the beginning when you open the page but I still don't know how to fix it. Or maybe it's cause the website is using Cookies. I'm literally looking for all the ideas I can get! Thx :D

I would find all tags: script and check if there is a keyword: featuredSearchTopic in it. Then I will convert the text into json (as a dictionary) then access the data 'description'.

import requests
from bs4 import BeautifulSoup
import json

def scrape_britannica(product_name):
    ### SETUP ###
    URL_raw = 'https://www.britannica.com/search?query=' + product_name
    URL = URL_raw.strip().replace(" ", "+")
    ## gets the html from the url
    try:
        page = requests.get(URL)
    except:
        print("Could not find URL..")

    ## a way to come around scrape blocking
    soup = BeautifulSoup(page.content, 'html.parser')
    #print(soup)

    for parent in soup.findAll("script"):  #, {"class": "search-feature-container"})
        if 'featuredSearchTopic' in str(parent):
            txt = json.loads(parent.text.replace(';','').split('=')[-1])
            print(txt.get('topicInfo').get('description'))


scrape_britannica('carl barks')

Result:

comic strip: Institutionalization: …Disney artists of them all, Carl Barks, sole creator of more than 500 of the best Donald Duck and other stories, was rescued from the oblivion to which the Disney policy of anonymity would consign him to become a cult figure. His Collected Works ran to 30 luxurious folio volumes.…...

You are dealing with a website which is running JavaScript to render it's data once the page loads, you can use the following approach which is loading the script source of the website which containing the part which you are looking for it. Now you do have tree and dict , so you can do whatever with it.

import requests
from bs4 import BeautifulSoup
import json


r = requests.get("https://www.britannica.com/search?query=world+war+2")
soup = BeautifulSoup(r.text, 'html.parser')

script = soup.findAll(
    "script", {'type': 'text/javascript'})[15].get_text(strip=True)

start = script.find("{")
end = script.rfind("}") + 1
data = script[start:end]

n = json.loads(data)

print(json.dumps(n, indent=4))

# print(n.keys())

# print(n["topicInfo"]["description"])

Output:

{
    "toc": [
        {
            "id": 1,
            "title": "Introduction",
            "url": "/event/World-War-II"
        },
        {
            "id": 53531,
            "title": "Axis initiative and Allied reaction",
            "url": "/event/World-War-II#ref53531"
        },
        {
            "id": 53563,
            "title": "The Allies\u2019 first decisive successes",
            "url": "/event/World-War-II/The-Allies-first-decisive-successes"
        },
        {
            "id": 53576,
            "title": "The Allied landings in Europe and the defeat of the Axis powers",
            "url": "/event/World-War-II/The-Allied-landings-in-Europe-and-the-defeat-of-the-Axis-powers"
        }
    ],
    "topicInfo": {
        "topicId": 648813,
        "imageId": 74903,
        "imageUrl": "https://cdn.britannica.com/s:300x1000/26/188426-050-2AF26954/Germany-Poland-September-1-1939.jpg",
        "imageAltText": "World War II",
        "title": "World War II",
        "identifier": "1939\u20131945",
        "description": "World War II, conflict that involved virtually every part of the world during the years 1939\u201345. The principal belligerents were the Axis powers\u2014Germany, Italy, and Japan\u2014and the Allies\u2014France, Great Britain, the United States, the Soviet Union, and, to a lesser extent, China. The war was in many...",
        "url": "/event/World-War-II"
    }
}

Output of print(n.keys())

dict_keys(['toc', 'topicInfo'])

Output of print(n["topicInfo"]["description"])

World War II, conflict that involved virtually every part of the world during the years 1939–45. The principal belligerents were the Axis powers—Germany, Italy, and Japan—and the Allies—France, Great Britain, the United States, the Soviet Union, and, to a lesser extent, China. The war was in many...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM