美湯中獲取數據表格

Question

我正在嘗試通過此頁面檢索股票的“流通股”：

https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#

（點擊“財務報表”-“簡明合並資產負債表（未經審計）（括號內）”）

數據在左行表格的底部，我正在使用漂亮的湯，但我在檢索共享計數時遇到了問題。

我正在使用的代碼：

import requests
from bs4 import BeautifulSoup

URL = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

rows = soup.find_all('tr')

for row in rows:
    document = row.find('a', string='Common stock, shares outstanding (in shares)')
    shares = row.find('td', class_='nump')
    if None in (document, shares):
        continue
    print(document)
    print(shares)

這什么都不返回，但所需的 output 是4,323,987,000

有人可以幫我檢索這些數據嗎？

謝謝！

Answer 1

那是一個 JS 渲染的頁面。 使用Selenium ：

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
# import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
# print(driver.page_source) # <--- this will give you source code 
soup = BeautifulSoup(driver.page_source)
rows = soup.find_all('tr')

for row in rows:
    shares = row.find('td', class_='nump')
    if shares:
        print(shares)

<td class="nump">4,334,335<span></span>
</td>
<td class="nump">4,334,335<span></span>
</td>

更好的使用：

shares = soup.find('td', class_='nump')
if shares:
    print(shares.text.strip())

4,334,335

Answer 2

啊，刮 EDGAR 文件的樂趣：（...

你沒有得到你期望的 output 因為你找錯了地方。 您擁有的 url 是一個 ixbrl 查看器。 數據來自這里：

url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019320000052/R1.htm'

您可以通過查看開發人員拍攝的網絡選項卡找到 url，或者，您可以簡單地將查看器 url 翻譯成這個320193&等數字：

一旦你弄清楚了，rest 很簡單：

req = requests.get(url)
soup = bs(req.text,'lxml')
soup.select_one('.nump').text.strip()

Output：

'4,334,335'

編輯：

要按“流通股”搜索，請嘗試：

targets = soup.select('tr.ro')
for target in targets:
    targ = target.select('td.pl')
    for t in targ:
        if "Shares Outstanding" in t.text:
            print(target.select_one('td.nump').text.strip())

不妨把這個扔進去：另一種不同的方法是使用 xpath 代替，使用 lxml 庫：

import lxml.html as lh

doc = lh.fromstring(req.text)
doc.xpath('//tr[@class="ro"]//td[@class="pl "][contains(.//text(),"Shares Outstanding")]/following-sibling::td[@class="nump"]/text()')[0]

美湯中獲取數據表格

問題描述

2 個解決方案

解決方案1
2 2020-07-02 20:13:03

解決方案2
1 2020-07-02 20:14:21

美湯中獲取數據表格

問題描述

2 個解決方案

解決方案1 2 2020-07-02 20:13:03

解決方案2 1 2020-07-02 20:14:21

解決方案1
2 2020-07-02 20:13:03

解決方案2
1 2020-07-02 20:14:21