![](/img/trans.png)
[英]Extracting table contents from html with python and BeautifulSoup
[英]Extracting table contents from a website with expandable table using Python and Selenium
我想从此网站提取以下数字: https : //www.allabolag.se/5560566258/bokslut
我已经尝试过使用Selenium,并且设法按行提取数字:
4 806 1 709 486
4 025 2 120 435
526 15 2
-38 12 2
-48 7 2
但是后来我意识到这些只是最近三年(2017年,2016年和2015年)。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import re
driver = webdriver.Chrome(executable_path="/Users/gabriele/Downloads/chromedriver")
driver.get("https://www.allabolag.se/5569640369/bokslut")
income_statement_raw = driver.find_element(By.ID, "bokslut")
income_statement_raw_box = income_statement_raw.find_elements_by_class_name("box")
#expected 4806 1709 486 177
year_count_of_financial_data_raw = income_statement_raw_box[0].find_elements_by_xpath('//div[@class="table__container table__container--padding-bleed-x box__bleed-x--up-to-small"]//table[@class="table--background-separator company-table"]/tbody')
print(year_count_of_financial_data_raw[0].text)
driver.close()
我希望收到4个数字,因为我可以在html中看到它(参见图片):
2017-12 2016-12 2015-12 2014-12
4806 1709 486 177
but the result so far is:
2017-12 2016-12 2015-12
4 806 1 709 486
我使用BeautifulSoup为您解析了该网页。
我不确定要提取的数据是100%的,所以我重点关注您在帖子中显示的“预期数据”,但是在data变量中,您将找到提取表中包含的所有行。
请记住,将适用于您平台的chromedriver放在脚本文件夹中(取消注释无头行以使浏览器不可见)。
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.allabolag.se/5569640369/bokslut"
options = Options()
#options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
first_table = soup.select_one("table:nth-of-type(1)")
data = []
rows = first_table.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip().replace(" ", "") for ele in cols]
data.append([ele for ele in cols if ele])
print(data[1])
#>>> ['4806', '1709', '486', '177']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.