简体   繁体   English

如何在熊猫数据框或numpy数组中过滤此数据?

[英]How to filter this data in pandas data frame or numpy array?

I'm trying to plot performance metrics of various assets in a back test. 我正在尝试在回溯测试中绘制各种资产的性能指标。

I have imported the 'test_predictions.json' into a pandas data frame. 我已经将“ test_predictions.json”导入到熊猫数据框中。 It is a list of dictionaries and contains results from various asset (listed one after the other), Here is a sample is the data: 它是字典的列表,包含各种资产的结果(一个接一个列出),以下是数据示例:

trading_pair return  timestamp prediction
[u'Poloniex_ETH_BTC' 0.003013302628677 1450753200L -0.157053292753482]
[u'Poloniex_ETH_BTC' 0.006013302628677 1450753206L -0.187053292753482]
...
[u'Poloniex_FCT_BTC' 0.006013302628677 1450753100L 0.257053292753482] 

Each backtest starts and ends at different times. 每个回测在不同的时间开始和结束。

Here' is the data for the assets of interest 这是有关资产的数据

'''
#These are the assets I would like to analyse
Poloniex_DOGE_BTC 2015-10-21 02:00:00 1445392800
Poloniex_DOGE_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XRP_BTC 2015-10-28 06:00:00 1446012000
Poloniex_XRP_BTC 2016-01-12 05:00:00 1452574800

Poloniex_XMR_BTC 2015-10-21 14:00:00 1445436000
Poloniex_XMR_BTC 2016-01-12 06:00:00 1452578400

Poloniex_VRC_BTC 2015-10-25 07:00:00 1445756400
Poloniex_VRC_BTC 2016-01-12 00:00:00 1452556800

'''

So i'm trying to make an new array that contains the data for these assets. 因此,我正在尝试制作一个包含这些资产数据的新数组。 Each asset must be sliced appropriately so they all start from the latest start time and end at earliest end time (other wise there will be incomplete data). 必须对每个资产进行适当的切片,以便它们都从最近的开始时间开始,并在最早的结束时间结束(否则,数据将不完整)。

#each array should start and end:  
#start 2015-10-28 06:00:00
#end 2016-01-12 00:00:00

So the question is: 所以问题是:

How can I search for an asset ie Poloniex_DOGE_BTC then acquire the index for start and end times specified above ? 如何搜索资产(例如Poloniex_DOGE_BTC然后获取上述指定的开始时间和结束时间的索引?

I will be plotting the data via numpy so maybe its better turn into a numpy array, df.values and the conduct the search? 我将通过numpy绘制数据,以便更好地将其转换为numpy数组, df.values并进行搜索? Then i could use np.hstack(df_index_asset1, def_index_asset2) so it's in the right form to plot. 然后我可以使用np.hstack(df_index_asset1, def_index_asset2)以便以正确的形式进行绘制。 So the problem is: using either pandas or numpy how do i retrieve the data for the specified assets which fall into the master start and end times? 所以问题是:使用pandas还是numpy我如何检索属于主开始时间和结束时间的指定资产的数据?

On a side note here the code i wrote to get the start and end dates, it's not to most efficient so improving that would be a bonus. 在这里,我为获得开始日期和结束日期而编写的代码并不是一个最有效的方法,因此改进代码将是一个好处。

EDIT: 编辑:

From Kartik's answer I tried to obtain just the data for asset name: 'Poloniex_DOGE_BTC' using the follow code: 从Kartik的答案中,我尝试使用以下代码仅获取资产名称“ Poloniex_DOGE_BTC”的数据:

import pandas as pd
import numpy as np

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby(asset)

print grouped

But throws this error 但是抛出这个错误

EDIT2: I have changed the link to the data so it is test_predictions.json` EDIT2:我已经更改了数据链接,所以它是test_predictions.json`

EDIT3: this worked a treat: EDIT3:这很有效:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800 

Now how can we truncate the data so that it starts and ends from the above timestamps ? 现在我们如何截断数据,使其从上述时间戳开始和结束?

Firstly, why like this? 首先,为什么会这样?

data = pd.read_json(preds).values
df = pd.DataFrame(data)

You can just write that as: 您可以将其写为:

df = pd.read_json(preds)

And if you want a NumPy array from df then you can execute data = df.values later. 而且,如果您希望从df获得NumPy数组,则可以稍后执行data = df.values

And it should put the data in a DataFrame. 并且应该将数据放入DataFrame中。 (Unless I am much mistaken, because I have never used read_json() before. (除非我很误会,因为我以前从未使用过read_json()

The second thing, is getting the data for each asset out. 第二件事是获取每个资产的数据。 For that, I am assuming you need to process all assets. 为此,我假设您需要处理所有资产。 To do that, you can simply do: 为此,您可以简单地执行以下操作:

# To convert it to datetime.
# This is not important, and you can skip it if you want, because epoch times in
# seconds will perfectly work with the rest of the method.
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')

# This will give you a group for each asset on which you can apply some function.
# We will apply min and max to get the desired output.
grouped = df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset names
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

Now start_times and end_times will be Series. 现在, start_timesend_times将为Series。 The index of this series will be your asset names, and the value will be the minimum and maximum times respectively. 该系列的索引将是您的资产名称,该值分别是最小和最大次数。

I think this is the answer you are looking for, from my understanding of your question. 根据我对问题的理解,我认为这是您正在寻找的答案。 Please let me know if that is not the case. 如果不是这种情况,请告诉我。

EDIT 编辑

If you are specifically looking for a few (one or two or ten) assets, you can modify the above code like so: 如果您专门寻找一些(一个或两个或十个)资产,则可以像上面那样修改上面的代码:

asset = ['list', 'of', 'required', 'assets'] # Even one element is fine.
req_df = df[df['trading_pair'].isin(asset)]

grouped = req_df.groupby('trading_pair') # Where 'trading_pair' is the name of the column that has the asset
start_times = grouped['timestamp'].min
end_times = grouped['timestamp'].max

EDIT2 this worked a treat: EDIT2可以这样治疗:

preds = 'test_predictions.json'

df = pd.read_json(preds)

asset = 'Poloniex_DOGE_BTC'

grouped = df.groupby('market_trading_pair')
print grouped.get_group(asset)`

#each array should start and end: 
#start 2015-10-28 06:00:00 1446012000
#end 2016-01-12 00:00:00 1452556800 

Now how can we truncate the data for that it starts from the above starts and ends at the above timestamps ? 现在,我们如何截断从上述时间戳开始并在上述时间戳结束的数据?


As an aside, plotting datetimes from Pandas is very convenient as well. 顺便说一句,从熊猫绘制日期时间也非常方便。 I use it all the time to produce most of the plots I create. 我一直使用它来制作我创建的大多数图。 And all of my data is timestamped. 而且我所有的数据都带有时间戳。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM