简体   繁体   English

合并具有非唯一索引的多个数据帧

[英]Merge multiple dataframes with non-unique indices

I have a bunch of pandas time series. 我有一堆熊猫时间序列。 Here is an example for illustration (real data has ~ 1 million entries in each series): 下面是一个示例(实际数据在每个系列中有大约100万个条目):

>>> for s in series:
    print s.head()
    print
2014-01-01 01:00:00   -0.546404
2014-01-01 01:00:00   -0.791217
2014-01-01 01:00:01    0.117944
2014-01-01 01:00:01   -1.033161
2014-01-01 01:00:02    0.013415
2014-01-01 01:00:02    0.368853
2014-01-01 01:00:02    0.380515
2014-01-01 01:00:02    0.976505
2014-01-01 01:00:02    0.881654
dtype: float64

2014-01-01 01:00:00   -0.111314
2014-01-01 01:00:01    0.792093
2014-01-01 01:00:01   -1.367650
2014-01-01 01:00:02   -0.469194
2014-01-01 01:00:02    0.569606
2014-01-01 01:00:02   -1.777805
dtype: float64

2014-01-01 01:00:00   -0.108123
2014-01-01 01:00:00   -1.518526
2014-01-01 01:00:00   -1.395465
2014-01-01 01:00:01    0.045677
2014-01-01 01:00:01    1.614789
2014-01-01 01:00:01    1.141460
2014-01-01 01:00:02    1.365290
dtype: float64

The times in each series are not unique. 每个系列中的时间并不是唯一的。 For example, the last series has 3 values at 2014-01-01 01:00:00 . 例如,最后一个系列在2014-01-01 01:00:00有3个值。 The second series has only one value at that time. 第二个系列当时只有一个值。 Also, not all the times need to be present in all the series . 此外,并非所有时间都需要出现在所有系列中

My goal is to create a merged DataFrame with times that are a union of all the times in the individual time series. 我的目标是创建一个合并的DataFrame ,其时间是各个时间序列中所有时间的并集。 Each timestamp should be repeated as many times as needed. 每个时间戳应根据需要重复多次。 So, if a timestamp occurs (2, 0, 3, 4) times in the series above, the timestamp should be repeated 4 times (the maximum of the frequencies) in the resulting DataFrame . 因此,如果上述系列中的时间戳出现(2, 0, 3, 4) DataFrame (2, 0, 3, 4)次,则时间戳应在结果DataFrame重复4次(最大频率)。 The values of each column should be "filled forward". 每列的值应“向前填充”。

As an example, the result of merging the above should be: 例如,合并上述结果应该是:

                             c0                c1              c2
2014-01-01 01:00:00   -0.546404         -0.111314       -0.108123
2014-01-01 01:00:00   -0.791217         -0.111314       -1.518526
2014-01-01 01:00:00   -0.791217         -0.111314       -1.395465
2014-01-01 01:00:01    0.117944          0.792093        0.045677
2014-01-01 01:00:01   -1.033161         -1.367650        1.614789
2014-01-01 01:00:01   -1.033161         -1.367650        1.141460
2014-01-01 01:00:02    0.013415         -0.469194        1.365290
2014-01-01 01:00:02    0.368853          0.569606        1.365290
2014-01-01 01:00:02    0.380515         -1.777805        1.365290
2014-01-01 01:00:02    0.976505         -1.777805        1.365290
2014-01-01 01:00:02    0.881654         -1.777805        1.365290

To give an idea of size and "uniqueness" in my real data: 在我的真实数据中给出大小和“唯一性”的概念:

>>> [len(s.index.unique()) for s in series]
[48617, 48635, 48720, 48620]
>>> len(times)
51043
>>> [len(s) for s in series]
[1143409, 1143758, 1233646, 1242864]

Here is what I have tried: 这是我尝试过的:

I can create a union of all the unique times: 我可以创建所有独特时间的联合:

uniques = [s.index.unique() for s in series]
times = uniques[0].union_many(uniques[1:])

I can now index each series using times : 我现在可以使用times索引每个系列:

series[0].loc[times]

But that seems to repeat the values for each item in times , which is not what I want. 但是,这似乎重复值,每个项目times ,这是不是我想要的。

I can't reindex() the series using times because the index for each series is not unique. 我不能使用times reindex()系列因为每个系列的索引不是唯一的。

I can do it by a slow Python loop or do it in Cython, but is there a "pandas-only" way to do what I want to do? 我可以通过一个缓慢的Python循环来完成它,或者在Cython中完成它,但是有一种“只有熊猫”的方式来做我想做的事情吗?

I created my example series using the following code: 我使用以下代码创建了我的示例系列:

def make_series(n=3, rep=(0,5)):
    times = pandas.date_range('2014/01/01 01:00:00', periods=n, freq='S')
    reps = [random.randint(*rep) for _ in xrange(n)]
    dates = []
    values = numpy.random.randn(numpy.sum(reps))
    for date, rep in zip(times, reps):
        dates.extend([date]*rep)
    return pandas.Series(data=values, index=dates)

series = [make_series() for _ in xrange(3)]

This is very nearly a concat: 这是非常近CONCAT:

In [11]: s0 = pd.Series([1, 2, 3], name='s0')

In [12]: s1 = pd.Series([1, 4, 5], name='s1')

In [13]: pd.concat([s0, s1], axis=1)
Out[13]:
   s0  s1
0   1   1
1   2   4
2   3   5

However , concat cannot deal with duplicate indices (it's ambigious how they should merge, and in your case you don't want to merge them in the "ordinary" way - as combinations)... 但是 ,concat无法处理重复的索引(它们应该如何合并它们,并且在你的情况下你不想以“普通”的方式合并它们 - 作为组合)......

I think you are going to use a groupby: 我想你打算用groupby:

In [21]: s0 = pd.Series([1, 2, 3], [0, 0, 1], name='s0')

In [22]: s1 = pd.Series([1, 4, 5], [0, 1, 1], name='s1')

Note: I've appended a faster method which works for int-like dtypes (like datetime64). 注意:我附加了一个更快的方法,适用于类似int的dtypes(如datetime64)。

We want to add a MultiIndex level of the cumcounts for each item, that way we trick the Index into becoming unique: 我们想为每个项目添加一个MultiIndex级别的cumcounts ,这样我们就可以使Index变得独特:

In [23]: s0.groupby(level=0).cumcount()
Out[23]:
0    0
0    1
1    0
dtype: int64

Note: I can't seem to append a column to the index without being a DataFrame.. 注意:我似乎无法在不作为DataFrame的情况下将列附加到索引。

In [24]: df0 = pd.DataFrame(s0).set_index(s0.groupby(level=0).cumcount(), append=True)

In [25]: df1 = pd.DataFrame(s1).set_index(s1.groupby(level=0).cumcount(), append=True)

In [26]: df0
Out[26]:
     s0
0 0   1
  1   2
1 0   3

Now we can go ahead and concat these: 现在我们可以继续这些:

In [27]: res = pd.concat([df0, df1], axis=1)

In [28]: res
Out[28]:
     s0  s1
0 0   1   1
  1   2 NaN
1 0   3   4
  1 NaN   5

If you want to drop the cumcount level: 如果你想放弃cumcount级别:

In [29]: res.index = res.index.droplevel(1)

In [30]: res
Out[30]:
   s0  s1
0   1   1
0   2 NaN
1   3   4
1 NaN   5

Now you can ffill to get the desired result... (if you were concerned about forward filling of different datetimes you could groupby the index and ffill). 现在你可以填写以获得所需的结果...(如果你担心前向填充不同的日期时间,你可以通过索引和ffill组合)。


If the upperbound on repetitions in each group was reasonable (I'm picking 1000, but much higher is still "reasonable"!, you could use a Float64Index as follows (and certainly it seems more elegant): 如果每组中的重复上限是合理的(我选择1000,但更高的仍然是“合理的”!),你可以使用Float64Index如下(当然看起来更优雅):

s0.index = s0.index + (s0.groupby(level=0)._cumcount_array() / 1000.)
s1.index = s1.index + (s1.groupby(level=0)._cumcount_array() / 1000.)
res = pd.concat([s0, s1], axis=1)
res.index = res.index.values.astype('int64')

Note: I'm cheekily using a private method here which returns the cumcount as a numpy array... 注意:我在这里使用私有方法,它将cumcount作为numpy数组返回...
Note2: This is pandas 0.14, in 0.13 you have to pass a numpy array to _cumcount_array eg np.arange(len(s0)) ), pre-0.13 you're out of luck - there's no cumcount. 注意2:这是pandas 0.14,在0.13中你必须将一个numpy数组传递给_cumcount_array例如np.arange(len(s0)) ),0.13之前你运气不好 - 没有cumcount。

How about this - convert to dataframes with labeled columns first, then concat(). 怎么样 - 首先转换为带有标记列的数据帧,然后转换为concat()。

s1 = pd.Series(index=['4/4/14','4/4/14','4/5/14'],
                      data=[12.2,0.0,12.2])
s2 = pd.Series(index=['4/5/14','4/8/14'],
                      data=[14.2,3.0])
d1 = pd.DataFrame(a,columns=['a'])
d2 = pd.DataFrame(b,columns=['b'])

final_df = pd.merge(d1, d2, left_index=True, right_index=True, how='outer')

This gives me 这给了我

           a     b
4/4/14  12.2   NaN
4/4/14   0.0   NaN
4/5/14  12.2   14.2
4/8/14   NaN   3.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比较具有不同长度的非唯一索引的数据帧的列值 - compare column values of dataframes with non-unique indices of different length 获取数组中非唯一项的索引 - getting indices of non-unique items in an array 如何合并/合并/加入 2 个具有非唯一多索引的数据帧以协调内容? - how to merge/concat/join 2 dataframes with a non-unique multi-index to reconcile the content? 获取字符串中非唯一单词的边界索引 - Get bounding indices of non-unique words in a string 如何在具有非唯一索引的情况下设置数据框样式 - How to style data frame while having non-unique indices 在Numpy数组中查找非唯一元素的索引 - Finding indices of non-unique elements in Numpy array 沿具有非唯一索引的列连接两个数据框 - Join two dataframes along columns with non-unique index 在具有非唯一元素的列上合并大小不同的熊猫数据框 - Merging pandas dataframes with different size on column with non-unique elements 合并 2 个 pandas 数据帧到一个非唯一但有条件选择非唯一值的列上(技术上是唯一的) - Merging 2 pandas dataframes on a column that is non unique but has conditions on selecting the non-unique values (technically then unique) ValueError:无法处理非唯一的多索引! 尝试将来自多个数据帧的列合并到一个 dataframe - ValueError: cannot handle a non-unique multi-index! When trying to combine columns from multiple dataframes in to one dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM