简体   繁体   English

从图表中抓取数据

[英]Web-scraping data from a graph

I am working with lobbying data from opensecrets.org, in particular industry data.我正在处理来自 opensecrets.org 的游说数据,尤其是行业数据。 I want to have a time series of lobby expenditures for each industry going back since the 90's.我想得到 90 年代以来每个行业的游说支出时间序列。

I want to web-scrape the data automatically.我想自动抓取数据。 Urls where the data is have the following format:数据所在的 URL 具有以下格式:

https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019 https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019

which are pretty easy to embed in a loop, the problem is that the data I need is not in an easy format in the webpage.很容易嵌入到循环中,问题是我需要的数据在网页中的格式并不简单。 It is inside a bar graph, and when I inspect the graph I do not know how to get the data since it is not in the html code.它在条形图中,当我检查图形时,我不知道如何获取数据,因为它不在 html 代码中。 I am familiar with web-scraping in python when the data is in the html code, but in this case I am not sure how to proceed.当数据在 html 代码中时,我熟悉 python 中的网络抓取,但在这种情况下,我不确定如何继续。

If there is an API, that your best bet as mentioned above.如果有 API,那是您最好的选择,如上所述。 But the data is able to be parsed anyway provided you get the right url/query parameters:但是,只要您获得正确的 url/查询参数,就可以解析数据:

I've managed to iterate through it with the links for you to grab each table.我已经设法通过链接遍历它,以便您抓取每个表。 I stored it in a dictionary with the key being the Firm name, and the value being the table/data.我将它存储在字典中,键是公司名称,值是表/数据。 You can change it up to anyway you'd like.您可以随意更改它。 Maybe just store as json, or save each as csv.也许只是存储为 json,或者将每个保存为 csv。

Code:代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://www.opensecrets.org/lobby/indusclient.php?id=H04&year=2019'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

data = requests.get(url, headers=headers)
soup = BeautifulSoup(data.text, 'html.parser')


links = soup.find_all('a', href=True)

root_url = 'https://www.opensecrets.org/lobby/include/IMG_client_year_comp.php?'
links_dict = {}

for each in links:
    if 'clientsum.php?' in each['href']:
        w=1
        firms = each.text
        link = root_url + each['href'].split('?')[-1].split('&')[0].strip() + '&type=c'
        links_dict[firms] = link


all_tables = {}
n=1
tot = len(links_dict)
for firms, link in links_dict.items():

    print ('%s of %s  ---- %s' %(n, tot, firms))
    data = requests.get(link)
    soup = BeautifulSoup(data.text, 'html.parser')

    results = pd.DataFrame()
    graph = soup.find_all('set')

    for each in graph:
        year = each['label']
        total = each['value']

        temp_df = pd.DataFrame([[year, total]], columns=['year','$mil'])
        results = results.append(temp_df,sort=True).reset_index(drop=True)

    all_tables[firms] = results
    n+=1

*Output:** *输出:**

Not going to print as there are 347 tables, but just so you see the structure:不会打印,因为有 347 个表,但只是为了让您看到结构:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM