简体   繁体   中英

Scrapping Dynamic Table Using Selenium WebDriver Wait Returning Truncated Data Frame

I try to scrap a dynamic table called "holding" from https://www.ishares.com/us/products/268752/ishares-global-reit-etf

At first I use selenium but I got empty DataFrame. Then community here helps suggest me to induce "WebDriverWait" to fully load the data before extracting it. It works but the data I got is truncated from 400 rows down to only 10 rows. How can I get all the data I need. Anyone could help me please. Thank you.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=options)
wd.get(site)

# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2  = pd.read_html(data)
holding = data2[0]

The code you wrote is ok, but you missed one point. The table as a default is designed with pagination, which shows only 10 records per page, and hence you retrieved only those records. You have to add an additional action step (clicking on 'Show More' button) which would show all records, and thus your df would have all of them. Here is the refactored code:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

# Instantiate options
options = webdriver.ChromeOptions()
options.headless = True

# Instantiate a webdriver
site = 'https://www.ishares.com/us/products/268752/ishares-global-reit-etf'
wd = webdriver.Chrome('chromedriver', options=options)
wd.maximize_window()
wd.get(site)

# Induce WebDriver Wait
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
wd.execute_script("arguments[0].scrollIntoView();", WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@data-componentname]/h2[normalize-space()='Holdings']"))))
WebDriverWait(wd, 20).until(EC.element_to_be_clickable((By.XPATH, "(//*[@class='datatables-utilities ui-helper-clearfix']//*[text()='Show More'])[2]"))).click()
data = WebDriverWait(wd, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@aria-describedby='allHoldingsTable_info']"))).get_attribute("outerHTML")
data2  = pd.read_html(data)
holding = data2[0]
print(holding)

Output:

    Ticker                           Name  ...    SEDOL Accrual Date
0      PLD              PROLOGIS REIT INC  ...  B44WZD7            -
1     EQIX               EQUINIX REIT INC  ...  BVLZX12            -
2      PSA            PUBLIC STORAGE REIT  ...  2852533            -
3      SPG  SIMON PROPERTY GROUP REIT INC  ...  2812452            -
4      DLR  DIGITAL REALTY TRUST REIT INC  ...  B03GQS4            -
..     ...                            ...  ...      ...          ...
379    MYR                        MYR/USD  ...        -            -
380    MYR                        MYR/USD  ...        -            -
381    MYR                        MYR/USD  ...        -            -
382    MYR                        MYR/USD  ...        -            -
383    MYR                        MYR/USD  ...        -            -

[384 rows x 12 columns]

Process finished with exit code 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM