简体   繁体   中英

How to scrape the Javascript based site https://marketchameleon.com/Calendar/Earnings using Selenium and Python?

I am trying to get earning dates from https://marketchameleon.com/Calendar/Earnings The site has a javascript loader that loads the earnings table, but when I am using selenium it is not appears. I tried chrome and firefox drivers.

a sample of the code:

firefox_driver_path = os.path.abspath('../firefoxdriver_win32/geckodriver.exe')
options = webdriver.FirefoxOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=options)
driver.get("https://marketchameleon.com/Calendar/Earnings")

how can I get the data?

I took your code added a few tweaks and ran a test to extract the earning dates from https://marketchameleon.com/Calendar/Earnings as follows:

  • Code Block:

     from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe') driver.get('https://marketchameleon.com/Calendar/Earnings') print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.dateselect_menu_h_table tr > th > span"))).text) print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='dateselect_menu_h_table']//tr/th/span"))).get_attribute("innerHTML"))

Observation

Similar to your observation, I have hit the same roadblock that using Selenium the earnings table doesn't loads:

市场变色龙


Deep Dive

While inspecting the DOM Tree of the webpage I found that some of the <script> and other tags refers to the keyword akam . As an example:

  • .function(){if(BOOMR=a,BOOMR||{}.BOOMR.plugins=BOOMR,plugins||{}..BOOMR?plugins:AK){var e=""=="true",1,0.t="",n="gertvyrrfrzvsxxfd3ta-f-81b1f5d51-clientnsv4-s.akamaihd.net"
  • <script type="text/javascript" src="https://marketchameleon.com/akam/11/4e7414cb" defer=""></script>
  • <noscript><img src="https://marketchameleon.com/akam/11/pixel_4e7414cb?a=dD03OTIxZTlmM2QwMWVhMDkxODhjNzQwN2E3NmFkNzRiMDQ5ODBkOGU0JmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript>
  • <link id="dnsprefetchlink" rel="dns-prefetch" href="//gertvyrrfrzvsxxfd3ta-f-81b1f5d51-clientnsv4-s.akamaihd.net">

Which is a clear indication that the website is protected by Bot Manager an advanced bot detection service provided by Akamai and the response gets blocked .


Bot Manager

As per the article Bot Manager - Foundations :

akamai_detection


Conclusion

So it can be concluded that the request for the data is detected as being performed by Selenium driven WebDriver instance and the response is blocked.


References

A couple of documentations:


tl; dr

A couple of relevant discussions:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM