繁体   English   中英

不使用JSON数据的Python / Selenium Web Scraping JS表

[英]Python/Selenium web Scraping JS table without using JSON data

我想从中刮桌子:

https://www2.sgx.com/securities/annual-reports-financial-statements

我了解这是可以通过研究标头并找到以下API调用来实现的: https : //api.sgx.com/financialreports/v1.0?pagestart=3&pagesize=250&params=id、companyName、documentDate、securityName、title, url,但我想知道是否可以不这样做就从表中获取所有数据,因为我需要解析16个JSON文件。

当尝试使用Selenium进行抓取时,我只能到达可见表的末尾(单击左侧的“全部清除”时,该表将变得更大,这是我需要的所有数据)。

任何想法欢迎!

编辑:这是代码,仅从表中的数千个返回144个单元格

from time import sleep  # to wait for stuff to finish.
from selenium import webdriver  # to interact with our site.
from selenium.common.exceptions import WebDriverException  #  url is wrong
from webdriver_manager import chrome  # to install and find the chromedriver executable


BASE_URL = 'https://www2.sgx.com/securities/annual-reports-financial-statements'
driver = webdriver.Chrome(executable_path=chrome.ChromeDriverManager().install())
driver.maximize_window()

try:
    driver.get(BASE_URL)
except WebDriverException:
    print("Url given is not working, please try again.")
    exit()

# clicking away pop-up
sleep(5)
header = driver.find_element_by_id("website-header")
driver.execute_script("arguments[0].click();", header)

# clicking the clear all button, to clear the calendar
sleep(2)
clear_field = driver.find_element_by_xpath('/html/body/div[1]/main/div[1]/article/template-base/div/div/sgx-widgets-wrapper/widget-filter-listing/widget-filter-listing-financial-reports/section[2]/div[1]/sgx-filter/sgx-form/div[2]/span[2]')
clear_field.click()

# clicking to select only Annual Reports
sleep(2)
driver.find_element_by_xpath("/html/body/div[1]/main/div[1]/article/template-base/div/div/sgx-widgets-wrapper/widget-filter-listing/widget-filter-listing-financial-reports/section[2]/div[1]/sgx-filter/sgx-form/div[1]/div[1]/sgx-input-select/label/span[2]/input").click()
sleep(1)
driver.find_element_by_xpath("//span[text()='Annual Report']").click()

rows = driver.find_elements_by_class_name("sgx-table-cell")
print(len(rows))

我知道您已要求不要使用API​​。 我认为使用它是更清洁的方法。

(输出为3709个文档)

import requests

URL_TEMPLATE = 'https://api.sgx.com/financialreports/v1.0?pagestart={}&pagesize=250&params=id%2CcompanyName%2CdocumentDate%2CsecurityName%2Ctitle%2Curl'

NUM_OF_PAGES = 16
data = []
for page_num in range(1, NUM_OF_PAGES):
    r = requests.get(URL_TEMPLATE.format(page_num))
    if r.status_code == 200:
        data.extend(r.json()['data'])
print('we have {} documents'.format(len(data)))
for doc in data:
    print(doc)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM