简体   繁体   中英

Python Web Scraping not getting all of the HTML

I am pretty new to web scraping in Python and am using BeautifulSoup for parsing. Once I have the HTML data I am trying to access something under the "< div id="root">.< /div>" but I am not getting all of the HTML that would show if I click "Inspect" on the actual website. How can I access under that or is that the way of the website blocking me from accessing the information on the webpage?

If that does not make sense, what I am saying is there is the "." in that div instead of more subcategories for me to see(which I see when I click inspect on the webpage".

This is my beautiful soup code...

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

myurl = 'https://www.coolbet.com/en/sports/incoming-bets'

#open connecting and grab content
uClient = uReq(myurl)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

#grabs each product
containers = page_soup.div.findAll("div", {"class":"sc-iuJeZd iJcGXh"})

print(containers)

It outputs [] because page_soup.div only outputs "< div id="root">.< /div>"

It appears to be the dynamic content so response you get with urlopen doesn't have what you see with inspect in your browser, so i would recommend using selenium webdriver to get that content..

After navigating to https://www.coolbet.com/en/sports/incoming-bets it seems there is no <div> with the classname you specified in the question. If I am right, you must be authenticated in order to get the desired results (i'm not 100% sure). For logging in via python(get your session cookies first):

import requests

url = "https://www.coolbet.com/en/login"
payload = {'username': 'abcdef', 'password': '123456'}
with requests.session() as s:
# fetch the login page
# post to the login form
r1 = s.get(url)
r2 = s.post(url, data=payload, cookies=r1.cookies)

The variable r2 contains the response(from above code snippet). Now scrape the page. Every website doesn't allow you to scrape their websites as part of robots.txt file. But some can be scraped by specifying a valid User-Agent header.Also please make sure that, scraping is allowed, by the website you are scraping from.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM