简体   繁体   中英

How do you scrape the whole page using selenium?

My goal is to be able to read a certain nested deep within a ton of divs. The only issue is that they seem to be dependent on javascript, so I can't get them by just using driver.page_source as far as I'm aware.

Here is my code:

import requests # for making standard html requests
from bs4 import BeautifulSoup # magical tool for parsing html data
import json # for parsing data
from pandas import DataFrame as df # premier library for data organization
import time
import lxml
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager


url = "https://www.challengermode.com/dota2/tournaments?state=upcoming"
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
time.sleep(5) # To let the page load in
soup_ID = BeautifulSoup(driver.page_source, 'html.parser')
print(soup_ID.prettify)

Here is an image of the span of information I want to be included in the print

So here is my output:

<bound method Tag.prettify of <html class="arena-html mod_flexbox mod_flexwrap mod_cssscrollbar mod_eventlistener mod_scriptasync mod_localstorage mod_sessionstorage mod_websockets mod_eventsource" id="html" lang="en" style="margin: 0px; padding: 0px;"><head>
<base href="/"/>
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,400i,500,700&amp;display=swap" rel="stylesheet"/>
<link as="style" href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/light.43d62e718e19239b66ac.css" rel="preload"/>
<link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/light.43d62e718e19239b66ac.css" media="all" onload="this.media='all'" rel="stylesheet"/>
<noscript><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/light.43d62e718e19239b66ac.css" rel="stylesheet"/></noscript>
<link as="style" href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/arena-paypal.26f2c9c2acd9b96ba93b.css" rel="preload"/>
<link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/arena-paypal.26f2c9c2acd9b96ba93b.css" media="all" onload="this.media='all'" rel="stylesheet"/>
<noscript><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/arena-paypal.26f2c9c2acd9b96ba93b.css" rel="stylesheet"/></noscript>
<script async="" src="https://widget.intercom.io/widget/yxk7m4ye" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/gtm/js?id=GTM-MHVMG4G&amp;t=gtag_UA_63855440_1&amp;cid=2113228608.1596037460" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/plugins/ua/linkid.js" type="text/javascript"></script><script async="" src="https://www.googleadservices.com/pagead/conversion_async.js" type="text/javascript"></script><script async="" src="https://connect.facebook.net/signals/config/1363905500304531?v=2.9.22&amp;r=stable"></script><script async="" crossorigin="anonymous" src="https://connect.facebook.net/en_US/sdk.js?hash=4c7217325ae946d41396c9d017814623&amp;ua=modern_es6"></script><script async="" src="https://www.googletagmanager.com/gtag/js?id=AW-969263990&amp;l=dataLayer&amp;cx=c" type="text/javascript"></script><script async="" src="https://www.google-analytics.com/analytics.js" type="text/javascript"></script><script async="" src="https://www.gstatic.com/recaptcha/releases/AFBwIe6h0oOL7MOVu88LHld-/recaptcha__en.js" type="text/javascript"></script><script id="facebook-jssdk" src="//connect.facebook.net/en_US/sdk.js"></script><script async="" src="https://connect.facebook.net/en_US/fbevents.js"></script><script async="true" crossorigin="anonymous" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/manifest.2aa5da30056e9cc4eae7.bundle.js"></script>
<title>Dota 2 Tournaments | Challengermode</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=750, user-scalable=no" name="viewport"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<link href="/pwa-manifest.json" rel="manifest"/>
<link href="/opensearch" rel="search" title="Challengermode" type="application/opensearchdescription+xml"/>
<meta content="#252730" name="theme-color"/>
<meta content="#252730" name="msapplication-navbutton-color"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black-translucent" name="apple-mobile-web-app-status-bar-style"/>
<meta content="Challengermode" name="apple-mobile-web-app-title"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/cm-192-logo.png" rel="apple-touch-icon" sizes="192x192"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/cm-512-logo.png" rel="apple-touch-icon" sizes="512x512"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphone6_splash.png" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphonex_splash.png" media="(device-width: 375px) and (device-height: 812px) and (-webkit-device-pixel-ratio: 3)" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphone6_splash.png" media="(device-width: 375px) and (device-height: 667px) and (-webkit-device-pixel-ratio: 2)" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphoneplus_splash.png" media="(device-width: 414px) and (device-height: 736px) and (-webkit-device-pixel-ratio: 3)" rel="apple-touch-startup-image"/>
<link href="https://challengermode-permanent-assets.azureedge.net/app/splashscreens/iphone5_splash.png" media="(device-width: 320px) and (device-height: 568px) and (-webkit-device-pixel-ratio: 2)" rel="apple-touch-startup-image"/>
<link href="https://www.challengermode.com/tournaments/feed" rel="alternate" type="application/atom+xml"/>
<link href="https://www.challengermode.com/spaces/feed" rel="alternate" type="application/atom+xml"/>
<link href="https://www.challengermode.com/classifieds/feed" rel="alternate" type="application/atom+xml"/>
<meta content="Leading platform for Dota 2 esports tournaments. Compete in quality tournaments from the best organizers or create your own space &amp; monetize your community." name="description"/>
<meta content="challengermode esports competitions tournaments leagues skills solo team organize host
lol league of legends csgo counter-strike: global offensive pubg playerunknowns battlegrounds dota 2 teamfight tactics tft valorant" name="keywords"/>
<meta content="index,follow" name="robots"/>
<meta content="English" name="language"/>
<link href="https://www.challengermode.com/dota2/tournaments?state=upcoming" rel="canonical"/>
<link href="https://api.challengermode.com" rel="dns-prefetch"/>
<link crossorigin="" href="https://api.challengermode.com" rel="preconnect"/>
<link href="https://syndication.twitter.com" rel="preconnect"/>
<link href="https://widget.intercom.io" rel="preconnect"/>
<link href="https://js.intercomcdn.com" rel="preconnect"/>
<link href="https://www.facebook.com" rel="preconnect"/>
<link crossorigin="" href="https://connect.facebook.net" rel="preconnect"/>
<link href="https://api-iam.intercom.io" rel="preconnect"/>
<link crossorigin="" href="https://fonts.gstatic.com" rel="preconnect"/>
<link href="https://az416426.vo.msecnd.net" rel="preconnect"/>
<link href="https://stats.g.doubleclick.net" rel="preconnect"/>
<link crossorigin="" href="https://fonts.googleapis.com" rel="preconnect"/>
<link href="https://dc.services.visualstudio.com" rel="preconnect"/>
<meta content="https://www.challengermode.com/dota2/tournaments?state=upcoming" property="og:url"/>
<meta content="Dota 2 Tournaments" property="og:title"/>
<meta content="Leading platform for Dota 2 esports tournaments. Compete in quality tournaments from the best organizers or create your own space &amp; monetize your community." property="og:description"/>
<meta content="https://challengermode-permanent-assets.azureedge.net/app/og_image.png" property="og:image"/>
<meta content="image/png" property="og:image:type"/>
<meta content="1200" property="og:image:width"/>
<meta content="630" property="og:image:height"/>
<meta content="website" property="og:type"/>
<meta content="Challengermode" property="og:site_name"/>
<meta content="cm:game_info_slug:f52a42ce-3425-4dca-ab1d-e425ea1e71ea" property="og:cm_resource"/>
<meta content="3625f24494c7ac4f0ad3" name="wot-verification"/>
<meta content="1179483245396310" property="fb:app_id"/>
<style>

    body::after {
        content: "none";
        display: none !important
    }

    @media (max-width:1920px) {
        body::after {
            content: "breakpoint--full-hd"
        }
    }

    @media (max-width:1280px) {
        body::after {
            content: "breakpoint--hd"
        }
    }

    @media (max-width:1024px) {
        body::after {
            content: "breakpoint--tablet"
        }
    }

    @media (max-width:414px) {
        body::after {
            content: "breakpoint--mobile"
        }
    }
</style>
<script src="//az416426.vo.msecnd.net/scripts/a/ai.0.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/0.1d1eb0a321bfe9aa47ee.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/1.97217bf357c5de4a751a.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/2.3240916b8c45c6c77a5b.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/3.966cc108df5a7515bf50.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/7.ed08c498b552166708b9.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/175.f6ae048c521d527a8f53.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/282.a0ab5b4c130061ae89b3.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/323.86bf89e818dd1c06cf21.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/337.c989accb4d8622d946e5.bundle.js"></script><style data-emotion=""></style><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/5.da829e90054bb31c6591.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/4.fc75798185acc24a996a.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/6.ba3b4ef40d494de88ed8.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/8.0a8441153a17e1c20931.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/9.92e08e43b5aeab83b11a.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/11.75d6926838e4e7c55f20.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/17.0c42d6a55e624fc36e4c.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/51.79196085aeb507e3486e.bundle.js"></script><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/10.5df7cf3cfa886d3230a3.css" rel="stylesheet" type="text/css"/><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/10.8c3b8aef15bdf341e192.bundle.js"></script><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/13.8ddd5b6f8bfee769c14a.css" rel="stylesheet" type="text/css"/><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/13.d20ed356ddb838ab76ce.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/16.9da04cea0e07cef002f4.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/22.13bf9d744401ea38a0bd.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/30.803fc5a3967c13785bb5.bundle.js"></script><link href="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/71.ab772642f9c8624e736d.css" rel="stylesheet" type="text/css"/><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/71.e7da16d37e16b62bf79b.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/158.818c18197b42c18410d9.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/284.8b5c95597f8814f01390.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/12.404ebfb3d2a9e09d5abc.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/75.9fddbb16d492adbd2ab5.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/161.1c91a2c2545bb21d7e20.bundle.js"></script><script src="https://googleads.g.doubleclick.net/pagead/viewthroughconversion/969263990/?random=1596037459961&amp;cv=9&amp;fst=1596037459961&amp;num=1&amp;bg=ffffff&amp;guid=ON&amp;resp=GooglemKTybQhCsO&amp;u_h=1080&amp;u_w=1920&amp;u_ah=1080&amp;u_aw=1920&amp;u_cd=24&amp;u_his=2&amp;u_tz=120&amp;u_java=false&amp;u_nplug=3&amp;u_nmime=4&amp;gtm=2oa7m1&amp;sendb=1&amp;ig=1&amp;data=event%3Dpage_view&amp;frm=0&amp;url=https%3A%2F%2Fwww.challengermode.com%2Fdota2%2Ftournaments%3Fstate%3Dupcoming&amp;tiba=Dota%202%20Tournaments%20%7C%20Challengermode&amp;hn=www.googleadservices.com&amp;async=1&amp;rfmt=3&amp;fmt=4"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/14.eb76c66c32e99864e5ad.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedge.net/dist2/15.1379135acdc99c059dcd.bundle.js"></script><script charset="utf-8" src="https://cmp-edge-webapp-cdn2.azureedg

The desired output would be to have all the source code marked in blue and red show up in the output.

Oh and if you have any questions or need more info, I'd gladly provide.

I have found a workaround. It seems that I'm not able to PRINT everything, but it's still stored. So if I use driver.find_element_by_class_name("link-white") it works perfectly for my goals.

dates = driver.find_elements_by_xpath('//span[@class="f--medium f--small--mobile fw--bold c--white-dark tt--u lh--1em ellipsis dis--blk"/span/span')
for a in dates:
    print(a.text)

The find_elements_by_xpath will grab any element within the page source, you will be given a list of selectors. Here we have dates nested in a span > span > span.

This is an xpath selector although you can do it through other means, css, id's etc...

  • // searches the whole html document
  • span[@class="xx"] - We want the span of class ="xx"
  • /span/span - using /span we can downstream any html tag. In this case /span/span.

Then I've created a for loop to print the text of all the dates on the page.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM