简体   繁体   English

使用 selenium 抓取分页表数据(Python)

[英]Using selenium to scrape paginated table data (Python)

I have this table: https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page=1 .我有这张桌子: https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page=1 It's paginated I want to scrape all the content from the table starting from page 1 to the very end.它是分页的,我想从第 1 页开始到最后刮掉表格中的所有内容。 I am trying to use the xpath but can't seem to get it to work.我正在尝试使用 xpath 但似乎无法正常工作。

Here is my code, any help welcome!这是我的代码,欢迎任何帮助!

from selenium import webdriver
from selenium.webdriver.common.by import By

import os




# co.add_argument('--ignore-certificate-errors')
#co.add_argument('--no-proxy-server')
#co.add_argument("--proxy-server='direct://'")
#co.add_argument("--proxy-bypass-list=*")
co = webdriver.ChromeOptions()
co.add_argument('--headless')
driver = webdriver.Chrome(executable_path="C:/Users/user/Desktop/IG Trading/chromedriver.exe", chrome_options=co)
driver.get('https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page=1')
stock_names = driver.find_elements(By.XPATH, '/html/body/app-root/app-handshake/div/app-page-content/app-filter-toggle/app-ftse-index-table/section/table')
print(stock_names)

# for stock_name in stock_names:
#     print(stock_name)
#     text = stock_name.text
#     print(text)

This is one way you can obtain that information:这是您获取该信息的一种方式:

from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options as Firefox_Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd
from tqdm import tqdm

firefox_options = Firefox_Options()

# firefox_options.add_argument("--width=1500")
# firefox_options.add_argument("--height=500")
# firefox_options.headless = True

driverService = Service('chromedriver/geckodriver')
browser = webdriver.Firefox(service=driverService, options=firefox_options)

big_df = pd.DataFrame()

browser.get('https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table')     
try:
    WebDriverWait(browser, 3).until(EC.element_to_be_clickable((By.ID, "ccc-notify-accept"))).click()
    print('accepted cookies')
except Exception as e:
    print('no cookie button!')
t.sleep(2)

for i in tqdm(range(1, 40)):
    browser.get(f'https://www.londonstockexchange.com/indices/ftse-aim-all-share/constituents/table?page={i}') 
    t.sleep(1)
    df = pd.read_html(WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "table[class='full-width ftse-index-table-table']"))).get_attribute('outerHTML'))[0]
    big_df = pd.concat([big_df, df], axis=0, ignore_index=True)

print(big_df)
big_df.to_csv('lse_companies.csv')
print('all done')
browser.quit()

This will display in terminal the big dataframe once all pages scraped, and also save it as a csv file on disk (in the same folder you are running your script from).这将在终端中显示大 dataframe 一旦所有页面都被刮掉,并将其保存为磁盘上的 csv 文件(在您运行脚本的同一文件夹中)。 Setup is Firefox/geckodriver on linux, however you can adapt it to your own, just observe the imports, and the logic after defining the browser/driver.在 linux 上设置是 Firefox/geckodriver,但是您可以根据自己的情况进行调整,只需观察导入以及定义浏览器/驱动程序后的逻辑即可。 Selenium docs: https://www.selenium.dev/documentation/ Selenium 文档: https://www.selenium.dev/documentation/

TQDM: https://pypi.org/project/tqdm/ TQDM: https://pypi.org/project/tqdm/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM