简体   繁体   English

使用python + beautifulSoup4从动态图中抓取数据

[英]scraping data from a dynamic graph using python+beautifulSoup4

I need to implement a data scraping task and extract data from a dynamic graph. 我需要执行数据抓取任务并从动态图提取数据。 The graph is update with time similar to what you would find if you look at the graph of a company's stock. 该图的更新时间与您查看公司股票的图相似。 I am using the requests and beautifulsoup4 library in python but I have only figured out how to scrape text and links data. 我正在python中使用requests和beautifulsoup4库,但我只想出了如何抓取文本和链接数据。 Can't seem to figure out how i can get the values of the graph into a csv file 似乎无法弄清楚如何将图形的值存入csv文件

The graph in question can be found at - http://www.apptrace.com/app/instagram/id389801252/ranks/topfreeapplications/36 可以在以下位置找到有问题的图形-http: //www.apptrace.com/app/instagram/id389801252/ranks/topfreeapplications/36

The data from the graph can be easily obtained if you have the correct URL. 如果您具有正确的URL,则可以轻松获取图表中的数据。 You can find this address rather easily with eg the "developer tools" in firefox (check the "Network" tab for the XHR requests). 您可以使用firefox中的“开发人员工具”轻松找到该地址(检查XHR请求的“网络”标签)。

You'll see calls are being made to eg: 您会看到正在拨打以下电话:

src = 'http://www.apptrace.com/api/app/389801252/rankings/country/?country=CAN&start_date=2014-08-08&end_date=&device=iphone&list_type=normal&chart_subtype=iphone'

If you call it, you'll be served a JSON reply which you can easily load into python: 如果调用它,系统将为您提供JSON答复,您可以轻松地将其加载到python中:

import json
import urllib

>>> data = urllib.urlopen(src).read()
>>> reply = json.loads(data)
>>> ranks = reply['rankings'][0]['ranks']
>>> res = {'date': [], 'rank': []}
>>> for d in ranks:
...     res['date'].append(d['date'])
...     res['rank'].append(d['rank'])
... 
>>> res['date'][:3]
[u'2014-08-08', u'2014-08-09', u'2014-08-10']
>>> res['rank'][:3]
[10, 14, 13]

You can then store the data into a csv using python's builtin csv module . 然后,您可以使用python的内置csv模块将数据存储到csv中

@Oliver W. provided a good answer already, but using requests ( link here ) avoids having to note the network call and is overall a much nicer package than urllib . @Oliver W.已经提供了一个很好的答案,但是使用requests此处的链接 )可以避免记录网络调用,并且总体上比urllib更好。

If you wanna be a bit more flexible with your code, you can write a function that takes the country name and start and end date. 如果您想更灵活地使用代码,可以编写一个使用国家/地区名称以及开始和结束日期的函数。

import requests
import pandas as pd
import json

def load_data(country='', start_date='2014-08-09', end_date='2014-11-1'):
    base = "http://www.apptrace.com/api/app/389801252/rankings/country/"
    extra = "?country={0}&start_date={1}&end_date={2}&device=iphone&list_type=normal&chart_subtype=iphone"
    addr = base + extra.format(country, start_date, end_date)

    page = requests.get(addr)
    json_data = page.json() #gets the json data from the page
    ranks = json_data['rankings'][0]['ranks']
    ranks = json.dumps(ranks)  #Ensures it has valid json format
    df = pd.read_json(ranks, orient='records')
    return df

Change things in the webpage to see what other values you can get from country (Canada is 'CAN' for example). 更改网页中的内容,以查看可以从国家/地区获得的其他值(例如,加拿大为“ CAN”)。 The empty string is for the USA. 空字符串用于美国。

The df looks like this df看起来像这样

    date        rank
0   2014-08-09  10
1   2014-08-10  10
2   2014-08-11  9
3   2014-08-12  8
4   2014-08-13  8
5   2014-08-14  7
6   2014-08-15  6
7   2014-08-16  8

With the pandas dataframe in hand, you can export to csv or combine many dataframes before you export 有了pandas数据框,您可以导出到csv或合并许多数据框,然后再导出

df = load_data()
df.to_csv("file_name.csv")

Could you provide a link for reference. 您能否提供一个参考链接。 It depends how the graph is stored and displayed. 这取决于图形的存储和显示方式。 Judging by it being dynamic like a stock ticker there should be some text between some tags you can grab somewhere. 通过像股票行情自动收录器这样的动态判断,应该可以在某些标签之间插入一些文本。 I have looked at examples of obtaining images and other content from websites using beautiful soup so its not impossible. 我看过一些使用漂亮的汤从网站上获取图像和其他内容的示例,因此这并非不可能。

Yesterday I was working on formatting the data into CSV format and got some really useful responses pronto. 昨天我正在将数据格式化为CSV格式,并且收到了一些非常有用的答复。

Check it out: How can I format every other line to be merged with the line before it? 签出: 如何格式化其他所有要与之合并的行? (In Python) (在Python中)

Also something I learnt here is if you will need to harvest that data often a good way to run scripts automatically is CRON jobs. 我在这里学到的东西是,如果您经常需要收集数据,那么CRON作业就是自动运行脚本的一种好方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM