使用Python和Selenium从具有可扩展表的网站中提取表内容

Question

我想从此网站提取以下数字： https : //www.allabolag.se/5560566258/bokslut

我已经尝试过使用Selenium，并且设法按行提取数字：

4 806   1 709   486 
4 025   2 120   435 
526       15    2   
-38       12    2   
-48       7     2

但是后来我意识到这些只是最近三年（2017年，2016年和2015年）。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome(executable_path="/Users/gabriele/Downloads/chromedriver")
driver.get("https://www.allabolag.se/5569640369/bokslut")

income_statement_raw = driver.find_element(By.ID, "bokslut")

income_statement_raw_box = income_statement_raw.find_elements_by_class_name("box")

#expected 4806  1709   486  177

year_count_of_financial_data_raw = income_statement_raw_box[0].find_elements_by_xpath('//div[@class="table__container table__container--padding-bleed-x box__bleed-x--up-to-small"]//table[@class="table--background-separator company-table"]/tbody')

print(year_count_of_financial_data_raw[0].text)

driver.close()

我希望收到4个数字，因为我可以在html中看到它（参见图片）：

2017-12 2016-12 2015-12 2014-12
  4806    1709    486     177



but the result so far is:
2017-12 2016-12 2015-12 
4 806   1 709    486

Answer 1

我使用BeautifulSoup为您解析了该网页。

我不确定要提取的数据是100％的，所以我重点关注您在帖子中显示的“预期数据”，但是在data变量中，您将找到提取表中包含的所有行。

请记住，将适用于您平台的chromedriver放在脚本文件夹中（取消注释无头行以使浏览器不可见）。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.allabolag.se/5569640369/bokslut"
options = Options()
#options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
first_table = soup.select_one("table:nth-of-type(1)")

data = []
rows = first_table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip().replace(" ", "") for ele in cols]
    data.append([ele for ele in cols if ele]) 

print(data[1])
#>>> ['4806', '1709', '486', '177']

使用Python和Selenium从具有可扩展表的网站中提取表内容

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-29 14:17:57

使用Python和Selenium从具有可扩展表的网站中提取表内容

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-29 14:17:57

解决方案1
1 已采纳 2019-08-29 14:17:57