同步熊猫中两个大数据帧的最有效方法是什么？

Question

我想同步两个非常长的数据帧，在此用例中，性能至关重要。 使用日期时间或时间戳按时间顺序对两个数据帧建立索引（应尽可能快地利用该索引）。

此示例中提供了一种同步方式：

import pandas as pd
df1=pd.DataFrame({'A':[1,2,3,4,5,6], 'B':[1,5,3,4,5,7]}, index=pd.date_range('20140101 101501', freq='u', periods=6))
df2=pd.DataFrame({'D':[10,2,30,4,5,10], 'F':[1,5,3,4,5,70]}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6))

# synch data frames
df3=df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')

我的问题是这是否是最有效的方法？ 如果有更快的方法可以解决此任务，我准备探索其他解决方案（例如，使用numpy或cython）。

谢谢

注意：通常，时间戳记不是等距分布的（如上例所示），在这种情况下该方法也应适用

阅读答案后发表评论

我认为在很多用例中，既不对齐也不合并或加入帮助。 关键是不要使用数据库相关的语义进行对齐（我认为时间序列的相关性不太高）。 对我来说，对齐方式是将系列A映射为B，并有一种方法来处理缺失值（通常是采样和保持方法），对齐和连接会产生不需要的效果，例如由于连接而重复了多个时间戳。 我仍然没有一个完美的解决方案，但是np.searchsorted似乎可以提供帮助（它比使用多个调用加入/对齐来完成我需要的速度要快得多）。 到目前为止，我还找不到熊猫做这件事的方法。

如何将A映射到B，以便B映射，以便结果具有A和B的所有时间戳，但没有重复（A和B中已经存在的重复）？

另一个典型的用例是采样和保持同步，可以通过以下有效方式解决（同步A与B同步，即对于A中的每个时间戳，获取B中的相应值：

idx=np.searchsorted(B.index.values, A.index.values, side='right')-1
df=A.copy()
for i in B:
    df[i]=B[i].ix[idx].values

结果df包含A的相同索引和B中的同步值。

有没有一种有效的方法可以直接在熊猫中做这些事情？

Answer 1

如果需要同步，请使用align ，文档在这里。 否则，合并是一个不错的选择。

In [18]: N=100000

In [19]: df1=pd.DataFrame({'A':[1,2,3,4,5,6]*N, 'B':[1,5,3,4,5,7]*N}, index=pd.date_range('20140101 101501', freq='u', periods=6*N))

In [20]: df2=pd.DataFrame({'D':[10,2,30,4,5,10]*N, 'F':[1,5,3,4,5,70]*N}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6*N))

In [21]: %timeit df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
10 loops, best of 3: 69.3 ms per loop

In [22]: %timeit df1.align(df2)
10 loops, best of 3: 36.5 ms per loop

In [24]: pd.set_option('max_rows',10)

In [25]: x, y = df1.align(df2)

In [26]: x
Out[26]: 
                             A   B   D   F
2014-01-01 10:15:01          1   1 NaN NaN
2014-01-01 10:15:01.000001   2   5 NaN NaN
2014-01-01 10:15:01.000002   3   3 NaN NaN
2014-01-01 10:15:01.000003   4   4 NaN NaN
2014-01-01 10:15:01.000004   5   5 NaN NaN
...                         ..  ..  ..  ..
2014-01-01 10:15:01.599998   5   5 NaN NaN
2014-01-01 10:15:01.599999   6   7 NaN NaN
2014-01-01 10:15:01.600000 NaN NaN NaN NaN
2014-01-01 10:15:01.600001 NaN NaN NaN NaN
2014-01-01 10:15:01.600002 NaN NaN NaN NaN

[600003 rows x 4 columns]

In [27]: y
Out[27]: 
                             A   B   D   F
2014-01-01 10:15:01        NaN NaN NaN NaN
2014-01-01 10:15:01.000001 NaN NaN NaN NaN
2014-01-01 10:15:01.000002 NaN NaN NaN NaN
2014-01-01 10:15:01.000003 NaN NaN  10   1
2014-01-01 10:15:01.000004 NaN NaN   2   5
...                         ..  ..  ..  ..
2014-01-01 10:15:01.599998 NaN NaN   2   5
2014-01-01 10:15:01.599999 NaN NaN  30   3
2014-01-01 10:15:01.600000 NaN NaN   4   4
2014-01-01 10:15:01.600001 NaN NaN   5   5
2014-01-01 10:15:01.600002 NaN NaN  10  70

[600003 rows x 4 columns]

Answer 2

如果您希望将其中一个DataFrame的索引用作同步模式，则可能有用：

df3 = df1.iloc[df1.index.isin(df2.index),]

注意：我猜df1的形状> df2的形状

在前面的代码片段中，您获得了df1和df2中的元素，但如果要添加新索引，则可能更喜欢这样做：

new_indexes = df1.index.diff(df2.index) # indexes of df1 and not in df2
default_values = np.zeros((new_indexes.shape[0],df2.shape[1])) 
df2 = df2.append(pd.DataFrame(default_values , index=new_indexes)).sort(axis=0)

您可以在这篇文章中看到另一种同步方式

Answer 3

我认为时间序列同步是一个非常简单的过程。 假设要填充的ts# (#=0,1,2)

ts#[0,:] -时间

ts#[1,:] -问

ts#[2,:] -出价

ts#[3,:] -asksz

ts#[4,:] -bidsz

输出是

totts[0,:] -同步时间

totts[1-4,:] -ts0的ask / bid / asksz / ts0

totts[5-8,:] - ts1 ask / bid / asksz / bidsz

totts[9-12,:] - ts2 ask / bid / asksz / bidsz

功能：

def syncTS(ts0,ts1,ts2):

    ti0 = ts0[0,:]
    ti1 = ts1[0,:]
    ti2 = ts2[0,:]

    totti = np.union1d(ti0, ti1)
    totti = np.union1d(totti,ti2)

    totts = np.ndarray((13,len(totti)))

    it0=it1=it2=0
    nT0=len(ti0)-1
    nT1=len(ti1)-1
    nT2=len(ti2)-1

    for it,tim in enumerate(totti):
        if tim >= ti0[it0] and it0 < nT0:
            it0+=1

        if tim >= ti1[it1] and it1 < nT1:
            it1 += 1

        if tim >= ti2[it2] and it2 < nT2:
            it2 += 1

        totts[0, it] = tim
        for k in range(1,5):
            totts[k, it] = ts0[k, it0]
            totts[k + 4, it] = ts1[k, it1]
            totts[k + 8, it] = ts2[k, it2]

    return totts

同步熊猫中两个大数据帧的最有效方法是什么？

问题描述

3 个解决方案

解决方案1
4 已采纳 2014-08-10 13:28:24

解决方案2
1 2014-10-31 15:13:07

解决方案3
0 2017-03-26 20:33:24

同步熊猫中两个大数据帧的最有效方法是什么？

问题描述

3 个解决方案

解决方案1 4 已采纳 2014-08-10 13:28:24

解决方案2 1 2014-10-31 15:13:07

解决方案3 0 2017-03-26 20:33:24

解决方案1
4 已采纳 2014-08-10 13:28:24

解决方案2
1 2014-10-31 15:13:07

解决方案3
0 2017-03-26 20:33:24