简体   繁体   English

美汤中获取数据表格

[英]Get data form table in beautiful soup

I am trying to retreive the 'Shares Outstanding' of a stock via this page:我正在尝试通过此页面检索股票的“流通股”:

https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v# https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#

(Click on 'Financial Statements' - 'Condensed Consolidated Balance Sheets (Unaudited) (Parenthical)') (点击“财务报表”-“简明合并资产负债表(未经审计)(括号内)”)

the data is in the bottom of the table in the left row, I am using beautiful soup but I am having issues with retreiving the sharecount.数据在左行表格的底部,我正在使用漂亮的汤,但我在检索共享计数时遇到了问题。

the code I am using:我正在使用的代码:

import requests
from bs4 import BeautifulSoup

URL = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

rows = soup.find_all('tr')

for row in rows:
    document = row.find('a', string='Common stock, shares outstanding (in shares)')
    shares = row.find('td', class_='nump')
    if None in (document, shares):
        continue
    print(document)
    print(shares)

this returns nothing, but the desired output is 4,323,987,000这什么都不返回,但所需的 output 是4,323,987,000

can someone help me to retreive this data?有人可以帮我检索这些数据吗?

Thanks!谢谢!

That's a JS rendered page.那是一个 JS 渲染的页面。 Use Selenium :使用Selenium

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
# import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
# print(driver.page_source) # <--- this will give you source code 
soup = BeautifulSoup(driver.page_source)
rows = soup.find_all('tr')

for row in rows:
    shares = row.find('td', class_='nump')
    if shares:
        print(shares)

<td class="nump">4,334,335<span></span>
</td>
<td class="nump">4,334,335<span></span>
</td>


Better use:更好的使用:

shares = soup.find('td', class_='nump')
if shares:
    print(shares.text.strip())

4,334,335

Ah, the joys of scraping EDGAR filings:(...啊,刮 EDGAR 文件的乐趣:(...

You're not getting your expected output because you're looking in the wrong place.你没有得到你期望的 output 因为你找错了地方。 The url you have is an ixbrl viewer.您拥有的 url 是一个 ixbrl 查看器。 The data comes from here:数据来自这里:

url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019320000052/R1.htm'

You can either find that url by looking at network tab in the developer tooks, or, you can simply translate the viewer url into this url: for example, the 320193& figure is the cik number, etc.您可以通过查看开发人员拍摄的网络选项卡找到 url,或者,您可以简单地将查看器 url 翻译成这个320193&等数字:

Once you figure that out, the rest is simple:一旦你弄清楚了,rest 很简单:

req = requests.get(url)
soup = bs(req.text,'lxml')
soup.select_one('.nump').text.strip()

Output: Output:

'4,334,335'

Edit:编辑:

To search by "Shares Outstanding", try:要按“流通股”搜索,请尝试:

targets = soup.select('tr.ro')
for target in targets:
    targ = target.select('td.pl')
    for t in targ:
        if "Shares Outstanding" in t.text:
            print(target.select_one('td.nump').text.strip())

And might as well throw this one in: Another, different way, to do that is to use xpath instead, using the lxml library:不妨把这个扔进去:另一种不同的方法是使用 xpath 代替,使用 lxml 库:

import lxml.html as lh

doc = lh.fromstring(req.text)
doc.xpath('//tr[@class="ro"]//td[@class="pl "][contains(.//text(),"Shares Outstanding")]/following-sibling::td[@class="nump"]/text()')[0]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM