简体   繁体   English

如何构建动态 Web 爬虫/爬虫:Python

[英]How to Build a Dynamic Web Scraper/Crawler: Python

Not really sure the complexity of this question, but figured I'd give it a shot.不太确定这个问题的复杂性,但我想我会试一试。

How can I create a web crawler/scraper (not sure which I'd need) to get a csv of all CEO pay-ratio data.我如何创建一个网络爬虫/抓取工具(不确定我需要哪个)来获取所有 CEO 薪酬比率数据的 csv。 https://www.bloomberg.com/graphics/ceo-pay-ratio/ https://www.bloomberg.com/graphics/ceo-pay-ratio/

I'd like this information for further analysis, however, I am not sure how to retrieve it for a dynamic webpage.我希望此信息用于进一步分析,但是,我不确定如何为动态网页检索它。 I have built web scrapers in the past, but for simple websites and functions.我过去曾构建过网络爬虫,但用于简单的网站和功能。

If you could point me to a good resource or post the code below I will forever be in your debt.如果你能给我指出一个好的资源或在下面发布代码,我将永远欠你的债。

Thanks in advance!提前致谢!

Since the website seems to load the content dynamically I believe you will be in need of Selenium , a library that automates browsers, and BeautifulSoup , a library to parse the resulting webpages.由于该网站似乎是动态加载内容,我相信您将需要Selenium (一个自动浏览器的库)和BeautifulSoup (一个解析生成的网页的库)。

Since the part of the website you are interested in is just the one page and you only need to retrieve the data I would suggest you to first investigate how the data are loaded to the page.由于您感兴趣的网站部分只是一页,您只需要检索数据,我建议您首先调查数据是如何加载到页面的。 It is plausible that you could make directly a request to their server with the same parameters as the script to retrieve directly the data you are interest in.您可以使用与脚本相同的参数直接向他们的服务器发出请求,以直接检索您感兴趣的数据,这似乎是合理的。

To make such a request you could consider using yet another library called requests .要发出这样的请求,您可以考虑使用另一个名为requests 的库。

Note that scraping this website may be flagged " as a violation of terms of service ", this particular website use multiple tech to avoid the scraping based on script engine.请注意,抓取本网站可能会被标记为“违反服务条款”,该特定网站使用多种技术来避免基于脚本引擎的抓取。


If you inspect the webpage, you may observe that when you click on the next button there is no XHR request.如果您检查网页,您可能会发现当您单击下一步按钮时,没有 XHR 请求。 So you may deduce that the content are loaded only one time.所以你可以推断出内容只加载了一次。

If you sort the request data by size, you will find that all data are loaded from a json file如果按照大小对请求数据进行排序,你会发现所有的数据都是从一个json文件中加载的


Using python (but you need to open the page just before running the python script):使用python(但您需要在运行python脚本之前打开页面):

import requests
data=requests.get("https://www.bloomberg.com/graphics/ceo-pay-ratio/live-data/ceo-pay-ratio/live/data.json").json()
for each in data['companies']:
    try:
        print "Company",each['c'],"=> CEO pay ratio",each['cpr']
    except:
        print "Company",each['c'],"=> no CEO pay ratio !"

Which give you:这给你:

Company Aflac Inc => CEO pay ratio 300
Company American Campus Communities Inc => CEO pay ratio 226
Company Aetna Inc => CEO pay ratio 235
Company Ameren Corp => CEO pay ratio 66
Company AmerisourceBergen Corp => CEO pay ratio 0
Company Advance Auto Parts Inc => CEO pay ratio 329
Company American International Group Inc => CEO pay ratio 697
Company Arthur J Gallagher & Co => CEO pay ratio 126
Company Arch Capital Group Ltd => CEO pay ratio 104
Company ACADIA Pharmaceuticals Inc => CEO pay ratio 54
[...]

Maybe better to open the json in webrowser then save it locally than trying to request the website.也许在 webrowser 中打开 json 然后将其保存在本地比尝试请求网站更好。

After local saving the json as data.json you can read it with:在本地将 json 保存为data.json您可以使用以下命令读取它:

import json

with open("data.json","r") as f:
    cont=f.read()

data=json.loads(cont)

for each in data['companies']:
    try:
        print "Company",each['c'],"=> CEO pay ratio",each['cpr']
    except:
        print "Company",each['c'],"=> no CEO pay ratio !"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM