简体   繁体   English

从 Python Beautifulsoup 中抓取表格

[英]Scraping table from Python Beautifulsoup

I tried to scrape table from this website: https://stockrow.com/VRTX/financials/income/quarterly我试图从这个网站刮表: https : //stockrow.com/VRTX/financials/income/quarterly

I am using Python Google Colab and I'd like to have the dates as columns.我正在使用 Python Google Colab,我希望将日期作为列。 (eg 2020-06-30 etc) I used code to do something like this: (例如 2020-06-30 等)我用代码来做这样的事情:

source = urllib.request.urlopen('https://stockrow.com/VRTX/financials/income/quarterly').read()
soup = bs.BeautifulSoup(source,'lxml')
table = soup.find_all('table')

However, I cannot get the tables.但是,我无法拿到桌子。 I am a bit new to scraping so I looked at other Stackoverflow pages but couldn't solve the problem.我对抓取有点陌生,所以我查看了其他 Stackoverflow 页面,但无法解决问题。 Can you please help me?你能帮我么? That would be much appreciated.那将不胜感激。

You can use their API to load the data:您可以使用他们的 API 来加载数据:

import requests
import pandas as pd


indicators_url = 'https://stockrow.com/api/indicators.json'
data_url = 'https://stockrow.com/api/companies/VRTX/financials.json?ticker=VRTX&dimension=Q&section=Income+Statement'

indicators = {i['id']: i for i in requests.get(indicators_url).json()}
all_data = []
for d in requests.get(data_url).json():
    d['id'] = indicators[d['id']]['name']
    all_data.append(d)

df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)

Prints:印刷:

                                     id    2020-06-30    2020-03-31    2019-12-31   2019-09-30   2019-06-30  ...   2011-12-31   2011-09-30    2011-06-30    2011-03-31    2010-12-31    2010-09-30
0          Consolidated Net Income/Loss   837270000.0   602753000.0   583234100.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
1      EPS (Basic, from Continuous Ops)        3.2248        2.3199        2.2654       0.2239        1.044  ...       0.9374        1.109       -0.9751       -0.8703       -0.8966       -1.0402
2                     Net Profit Margin        0.5492        0.3978        0.4127       0.0606       0.2841  ...       0.2816       0.3354       -1.5213       -2.3906       -2.7531       -8.7816
3                          Gross Profit  1339965000.0  1352610000.0  1228253000.0  817914000.0  805553000.0  ...  533213000.0  620794000.0   105118000.0    70996000.0    62475000.0    20567000.0
4                  Income Tax Provision   -12500000.0    54781000.0    93716000.0   13148000.0   59711000.0  ...   22660000.0  -27842000.0    24448000.0           0.0           NaN           0.0
5                      Operating Income   718033000.0   720224100.0   551464400.0   99333000.0  269960000.0  ...  223901900.0  215707000.0  -165890000.0  -159899000.0  -166634000.0  -199588000.0
6                                  EBIT   718033000.0   720224100.0   551464700.0   99333000.0  269960000.0  ...  223901900.0  215707000.0  -165890000.0  -159899000.0  -166634000.0  -199588000.0
7         EPS (Diluted, from Cont. Ops)        3.1787        2.2874        2.2319       0.2208       1.0293  ...       1.0011       1.0415       -0.9751       -0.8703       -0.8966       -1.0402
8                                EBITDA   744730000.0   747045000.0   577720400.0  125180000.0  297658000.0  ...  233625900.0  223457000.0  -157181000.0  -151041000.0  -158429000.0  -192830000.0
9             EPS (Basic, Consolidated)        3.2248        2.3199        2.2654       0.2239        1.044  ...       0.9374        1.109       -0.9751       -0.8703       -0.8966       -1.0402
10                                  EBT   824770000.0   657534000.0   676950000.0   70666000.0  327138000.0  ...  210801000.0  200610000.0  -174870000.0  -176096000.0  -180392000.0  -208957000.0
11           Operating Cash Flow Margin        0.6812        0.5384        0.3156       0.3525       0.4927  ...       0.8941       0.0651       -1.8894       -2.5336        -2.535       -6.8918
12                           EBT margin         0.541         0.434         0.479       0.0744       0.3475  ...       0.3742       0.3043       -1.5283       -2.3906       -2.7531       -8.7816
13                          EBIT Margin         0.471        0.4754        0.3902       0.1046       0.2868  ...       0.3975       0.3272       -1.4498       -2.1707       -2.5431       -8.3878
14    Income from Continuous Operations   837270000.0   602753000.0   583234000.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
15                         R&D Expenses   420928000.0   448528000.0   480011000.0  555948000.0  379091000.0  ...  186438000.0  189052000.0   173604000.0   158612000.0   168888000.0   170434000.0
16      Non-operating Interest Expenses    13871000.0    14136000.0    14249000.0   14548000.0   14837000.0  ...   11659000.0    7059000.0     6962000.0    12001000.0     7686000.0     3951000.0
17                        EBITDA Margin        0.4885        0.4931        0.4088       0.1318       0.3162  ...       0.4147        0.339       -1.3737       -2.0505       -2.4179       -8.1038
18         Non-operating Income/Expense   106737000.0   -62690000.0   125485000.0  -28667000.0   57178000.0  ...  -13101000.0  -15097000.0    -8980000.0   -16197000.0   -13758000.0    -9369000.0
19                          EPS (Basic)          3.22          2.32          2.26         0.22         1.04  ...         0.76         1.06         -0.85         -0.87          -0.9         -1.04
20                         Gross Margin         0.879        0.8927        0.8691       0.8611       0.8558  ...       0.9465       0.9417        0.9187        0.9638        0.9535        0.8643
21                              Revenue  1524485000.0  1515107000.0  1413265000.0  949828000.0  941293000.0  ...  563340000.0  659200000.0   114424000.0    73662000.0    65524000.0    23795000.0
22            Shares (Diluted, Average)   263403000.0   263515000.0   262108000.0  260473000.0  259822000.0  ...  217602000.0  219349000.0   204413000.0   202329000.0   201355000.0   200887000.0
23                      Cost of Revenue   184520000.0   162497000.0   185012000.0  131914000.0  135740000.0  ...   30127000.0   38406000.0     9306000.0     2666000.0     3049000.0     3228000.0
24                        SG&A Expenses   191804000.0   182258000.0   195277000.0  159674000.0  156502000.0  ...  121881000.0  110654000.0    96663000.0    71523000.0    62478000.0    48855000.0
25          EPS (Diluted, Consolidated)        3.1787        2.2874        2.2319       0.2208       1.0293  ...       1.0011       1.0415       -0.9751       -0.8703       -0.8966       -1.0402
26                       Revenue Growth        0.6196         0.765        0.6242       0.2107       0.2515  ...       7.5975      26.7033        2.6185        2.2842        0.9335       -0.0466
27             Shares (Basic, Weighted)   259637000.0   259815000.0   256728000.0  256946000.0  256154000.0  ...  204891000.0  206002000.0   204413000.0   202329000.0   200402000.0   200887000.0
28                     Income after Tax   837270000.0   602753000.0   583234000.0   57518000.0  267427000.0  ...  188141000.0  228452000.0  -199318000.0  -176096000.0  -180392000.0  -208957000.0
29                        EPS (Diluted)          3.18          2.29          2.23         0.22         1.03  ...         0.74         1.02         -0.85         -0.87          -0.9         -1.04
30                    Net Income Common   837270000.0   602753000.0   583234100.0   57518000.0  267427000.0  ...  158629000.0  221110000.0  -174069000.0  -176096000.0  -180392000.0  -208957000.0
31           Shares (Diluted, Weighted)   263403000.0   263515000.0   260673000.0  260473000.0  259822000.0  ...  208807000.0  219349000.0   204413000.0   202329000.0   200402000.0   200887000.0
32             Non-Controlling Interest           NaN           NaN           NaN          NaN          NaN  ...   29512000.0    7342000.0   -25249000.0           0.0           NaN           0.0
33                Dividends (Preferred)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
34   EPS (Basic, from Discontinued Ops)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
35        EPS (Diluted, from Disc. Ops)           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN
36  Income from Discontinued Operations           NaN           NaN           NaN          NaN          NaN  ...          NaN          NaN           NaN           NaN           NaN           NaN

[37 rows x 41 columns]

And saves data.csv :并保存data.csv

在此处输入图片说明


Or donwload their XLSX from that page:或者从该页面下载他们的 XLSX:

url = 'https://stockrow.com/api/companies/VRTX/financials.xlsx?dimension=Q&section=Income%20Statement&sort=desc'

df = pd.read_excel(url)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
print(df)

First problem is, that table is loaded via javascript and BeautifulSoup does not find it, because it's not loaded yet at the moment of parsing.第一个问题是,该表是通过 javascript 加载的,而 BeautifulSoup 没有找到它,因为它在解析时尚未加载。 To solve this problem you'll need to use selenium.要解决此问题,您需要使用硒。

Second problem is, that there is no table tag in HTML, it uses grid formatting.第二个问题是,HTML 中没有 table 标记,它使用网格格式。

Since you're using Google Colab, you'll need to install there selenium web driver (code taken from this answer ):由于您使用的是 Google Colab,因此您需要安装 selenium Web 驱动程序(代码取自此答案):

!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)

After that you can load the page and parse it:之后,您可以加载页面并解析它:

from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

# load page via selenium
wd.get("https://stockrow.com/VRTX/financials/income/quarterly")

# wait 5 seconds until element with class mainGrid will be loaded
grid = WebDriverWait(wd, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'mainGrid')))

# parse content of the grid
soup = BeautifulSoup(grid.get_attribute('innerHTML'), 'lxml')

# access grid cells, your logic should be here
for tag in soup.find_all('div', {'class': 'financials-value'}):
  print(tag)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM