从图表中抓取数据

Question

I am working with lobbying data from opensecrets.org, in particular industry data.我正在处理来自 opensecrets.org 的游说数据，尤其是行业数据。 I want to have a time series of lobby expenditures for each industry going back since the 90's.我想得到 90 年代以来每个行业的游说支出时间序列。

I want to web-scrape the data automatically.我想自动抓取数据。 Urls where the data is have the following format:数据所在的 URL 具有以下格式：

https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019 https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019

which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage.很容易嵌入到循环中，问题是我需要的数据在网页中的格式并不简单。 It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code.它在条形图中，当我检查图形时，我不知道如何获取数据，因为它不在 html 代码中。 I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.当数据在 html 代码中时，我熟悉 python 中的网络抓取，但在这种情况下，我不确定如何继续。

Answer 1

If there is an API, that your best bet as mentioned above.如果有 API，那是您最好的选择，如上所述。 But the data is able to be parsed anyway provided you get the right url/query parameters:但是，只要您获得正确的 url/查询参数，就可以解析数据：

I've managed to iterate through it with the links for you to grab each table.我已经设法通过链接遍历它，以便您抓取每个表。 I stored it in a dictionary with the key being the Firm name, and the value being the table/data.我将它存储在字典中，键是公司名称，值是表/数据。 You can change it up to anyway you'd like.您可以随意更改它。 Maybe just store as json, or save each as csv.也许只是存储为 json，或者将每个保存为 csv。

Code:代码：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')


links = soup.find_all('a', href=True)

root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}

for each in links:
    if 'clientsum.php?' in each['href']:
        w=1
        firms = each.text
        link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
        links_dict[firms] = link


all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():

    print ('%s of %s  ---- %s' %(n, tot, firms))
    data = requests.get(link)
    soup = BeautifulSoup(data.text, 'html.parser')

    results = pd.DataFrame()
    graph = soup.find_all('set')

    for each in graph:
        year = each['label']
        total = each['value']

        temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
        results = results.append(temp_df,sort=True).reset_index(drop=True)

    all_tables[firms] = results
    n+=1

*Output:** *输出：**

Not going to print as there are 347 tables, but just so you see the structure:不会打印，因为有 347 个表，但只是为了让您看到结构：

从图表中抓取数据

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-28 15:49:42

从图表中抓取数据

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-28 15:49:42

解决方案1
0 已采纳 2019-08-28 15:49:42