简体   繁体   中英

Can't find existing element with tag using beautifulsoup and requests

When I want to scrape all data from hhttps://www.encyclopedia.com/gsearch?q=world+war+2 I can't find all the elements. I am specifically looking for everything inside the a element with the class: gs-title (it's the first link to a new forum that aren't a sponsored link) , but when I scrape it just says that it found None. This is my code:

def scrape_encyclopedia(product_name):
    ### SETUP ###
    URL_raw = 'https://www.encyclopedia.com/gsearch?q=' + product_name
    URL = URL_raw.strip().replace(" ", "+")
    ## gets the html from the url
    try:
        page = requests.get(URL)
    except:
        print("Could not find URL..")

    ## a way to come around scrape blocking
    soup = BeautifulSoup(page.content, 'html.parser')

    parent = soup.find("a", {"class": "gs-title"})
    print(parent)

scrape_encyclopedia('World War 2')

I guess it has something to do with it not loading in the beginning when you open the page but I still don't know how to fix it. Or maybe it's cause the website is using Cookies. I'm literally looking for all the ideas I can get! Thx :D

The reason you can't see the element with the class gs-title because it's not there when you load(GET) the page. The response is just a skeleton. After the page loads, it makes API calls and only when the response is received of this call, the page is restructured and data is then displayed.

Your issue --> By the time api call is made, the python code has returned back with the skeleton html.

This is always the case for SPAs or Apps made using Frontend frameworks.

Your solution would be to find this api call and how it changes with different search query. (In your case, the current query is World War 2 ).

I was able to find this API call.

https://cse.google.com/cse/element/v1?rsz=filtered_cse&num=10&hl=en&source=gcsc&gss=.com&cselibv=8b2252448421acb3&cx=partner-pub-8594935838850960:3418779728&q=world%20war%202&safe=off&cse_tok=AJvRUv3VJzJXI-LILOFzN60EWiCv:1584040108582&exp=csqr,cc&callback=google.search.cse.api7320

If you see closely, your query is embedded in the url and when you click on this, you would receive a json file which has one key with name results which is an array and within that you would see all your data.

1 2

Now if this is your only use case, then I would suggest you make python requests.get() call for this api and take the json data and dump it into a dict and then traverse it for your data.

If this is not your only case, then check how the API url is made for different search query and then generate that and do a get and traverse results key.

I hope it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM