简体   繁体   English

使用Python和Selenium从具有可扩展表的网站中提取表内容

[英]Extracting table contents from a website with expandable table using Python and Selenium

I want to extract the following numbers from this website: https://www.allabolag.se/5560566258/bokslut 我想从此网站提取以下数字: https : //www.allabolag.se/5560566258/bokslut

I have tried using Selenium and I managed to extract the numbers by row: 我已经尝试过使用Selenium,并且设法按行提取数字:

4 806   1 709   486 
4 025   2 120   435 
526       15    2   
-38       12    2   
-48       7     2   

But then I realised these are only for 3 latest years (2017, 2016, and 2015). 但是后来我意识到这些只是最近三年(2017年,2016年和2015年)。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import re

driver = webdriver.Chrome(executable_path="/Users/gabriele/Downloads/chromedriver")
driver.get("https://www.allabolag.se/5569640369/bokslut")

income_statement_raw = driver.find_element(By.ID, "bokslut")

income_statement_raw_box = income_statement_raw.find_elements_by_class_name("box")

#expected 4806  1709   486  177

year_count_of_financial_data_raw = income_statement_raw_box[0].find_elements_by_xpath('//div[@class="table__container table__container--padding-bleed-x box__bleed-x--up-to-small"]//table[@class="table--background-separator company-table"]/tbody')

print(year_count_of_financial_data_raw[0].text)

driver.close()

I expect to receive 4 numbers since I can see it in the html (see image): 我希望收到4个数字,因为我可以在html中看到它(参见图片): 在此处输入图片说明

2017-12 2016-12 2015-12 2014-12
  4806    1709    486     177



but the result so far is:
2017-12 2016-12 2015-12 
4 806   1 709    486    

I've used BeautifulSoup to parse the webpage for you. 我使用BeautifulSoup为您解析了该网页。

I am not 100% sure about the data you want to extract so I focused on the "expected data" you showed in your post but in the data variable you will find all the rows contained in the extracted table. 我不确定要提取的数据是100%的,所以我重点关注您在帖子中显示的“预期数据”,但是在data变量中,您将找到提取表中包含的所有行。

Please remember putting the chromedriver for your platform in the script folder (uncomment the headless row to make the browser invisible). 请记住,将适用于您平台的chromedriver放在脚本文件夹中(取消注释无头行以使浏览器不可见)。

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.allabolag.se/5569640369/bokslut"
options = Options()
#options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
first_table = soup.select_one("table:nth-of-type(1)")

data = []
rows = first_table.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip().replace(" ", "") for ele in cols]
    data.append([ele for ele in cols if ele]) 

print(data[1])
#>>> ['4806', '1709', '486', '177']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM