简体   繁体   中英

Merge multiple dataframes with non-unique indices

I have a bunch of pandas time series. Here is an example for illustration (real data has ~ 1 million entries in each series):

>>> for s in series:
    print s.head()
    print
2014-01-01 01:00:00   -0.546404
2014-01-01 01:00:00   -0.791217
2014-01-01 01:00:01    0.117944
2014-01-01 01:00:01   -1.033161
2014-01-01 01:00:02    0.013415
2014-01-01 01:00:02    0.368853
2014-01-01 01:00:02    0.380515
2014-01-01 01:00:02    0.976505
2014-01-01 01:00:02    0.881654
dtype: float64

2014-01-01 01:00:00   -0.111314
2014-01-01 01:00:01    0.792093
2014-01-01 01:00:01   -1.367650
2014-01-01 01:00:02   -0.469194
2014-01-01 01:00:02    0.569606
2014-01-01 01:00:02   -1.777805
dtype: float64

2014-01-01 01:00:00   -0.108123
2014-01-01 01:00:00   -1.518526
2014-01-01 01:00:00   -1.395465
2014-01-01 01:00:01    0.045677
2014-01-01 01:00:01    1.614789
2014-01-01 01:00:01    1.141460
2014-01-01 01:00:02    1.365290
dtype: float64

The times in each series are not unique. For example, the last series has 3 values at 2014-01-01 01:00:00 . The second series has only one value at that time. Also, not all the times need to be present in all the series .

My goal is to create a merged DataFrame with times that are a union of all the times in the individual time series. Each timestamp should be repeated as many times as needed. So, if a timestamp occurs (2, 0, 3, 4) times in the series above, the timestamp should be repeated 4 times (the maximum of the frequencies) in the resulting DataFrame . The values of each column should be "filled forward".

As an example, the result of merging the above should be:

                             c0                c1              c2
2014-01-01 01:00:00   -0.546404         -0.111314       -0.108123
2014-01-01 01:00:00   -0.791217         -0.111314       -1.518526
2014-01-01 01:00:00   -0.791217         -0.111314       -1.395465
2014-01-01 01:00:01    0.117944          0.792093        0.045677
2014-01-01 01:00:01   -1.033161         -1.367650        1.614789
2014-01-01 01:00:01   -1.033161         -1.367650        1.141460
2014-01-01 01:00:02    0.013415         -0.469194        1.365290
2014-01-01 01:00:02    0.368853          0.569606        1.365290
2014-01-01 01:00:02    0.380515         -1.777805        1.365290
2014-01-01 01:00:02    0.976505         -1.777805        1.365290
2014-01-01 01:00:02    0.881654         -1.777805        1.365290

To give an idea of size and "uniqueness" in my real data:

>>> [len(s.index.unique()) for s in series]
[48617, 48635, 48720, 48620]
>>> len(times)
51043
>>> [len(s) for s in series]
[1143409, 1143758, 1233646, 1242864]

Here is what I have tried:

I can create a union of all the unique times:

uniques = [s.index.unique() for s in series]
times = uniques[0].union_many(uniques[1:])

I can now index each series using times :

series[0].loc[times]

But that seems to repeat the values for each item in times , which is not what I want.

I can't reindex() the series using times because the index for each series is not unique.

I can do it by a slow Python loop or do it in Cython, but is there a "pandas-only" way to do what I want to do?

I created my example series using the following code:

def make_series(n=3, rep=(0,5)):
    times = pandas.date_range('2014/01/01 01:00:00', periods=n, freq='S')
    reps = [random.randint(*rep) for _ in xrange(n)]
    dates = []
    values = numpy.random.randn(numpy.sum(reps))
    for date, rep in zip(times, reps):
        dates.extend([date]*rep)
    return pandas.Series(data=values, index=dates)

series = [make_series() for _ in xrange(3)]

This is very nearly a concat:

In [11]: s0 = pd.Series([1, 2, 3], name='s0')

In [12]: s1 = pd.Series([1, 4, 5], name='s1')

In [13]: pd.concat([s0, s1], axis=1)
Out[13]:
   s0  s1
0   1   1
1   2   4
2   3   5

However , concat cannot deal with duplicate indices (it's ambigious how they should merge, and in your case you don't want to merge them in the "ordinary" way - as combinations)...

I think you are going to use a groupby:

In [21]: s0 = pd.Series([1, 2, 3], [0, 0, 1], name='s0')

In [22]: s1 = pd.Series([1, 4, 5], [0, 1, 1], name='s1')

Note: I've appended a faster method which works for int-like dtypes (like datetime64).

We want to add a MultiIndex level of the cumcounts for each item, that way we trick the Index into becoming unique:

In [23]: s0.groupby(level=0).cumcount()
Out[23]:
0    0
0    1
1    0
dtype: int64

Note: I can't seem to append a column to the index without being a DataFrame..

In [24]: df0 = pd.DataFrame(s0).set_index(s0.groupby(level=0).cumcount(), append=True)

In [25]: df1 = pd.DataFrame(s1).set_index(s1.groupby(level=0).cumcount(), append=True)

In [26]: df0
Out[26]:
     s0
0 0   1
  1   2
1 0   3

Now we can go ahead and concat these:

In [27]: res = pd.concat([df0, df1], axis=1)

In [28]: res
Out[28]:
     s0  s1
0 0   1   1
  1   2 NaN
1 0   3   4
  1 NaN   5

If you want to drop the cumcount level:

In [29]: res.index = res.index.droplevel(1)

In [30]: res
Out[30]:
   s0  s1
0   1   1
0   2 NaN
1   3   4
1 NaN   5

Now you can ffill to get the desired result... (if you were concerned about forward filling of different datetimes you could groupby the index and ffill).


If the upperbound on repetitions in each group was reasonable (I'm picking 1000, but much higher is still "reasonable"!, you could use a Float64Index as follows (and certainly it seems more elegant):

s0.index = s0.index + (s0.groupby(level=0)._cumcount_array() / 1000.)
s1.index = s1.index + (s1.groupby(level=0)._cumcount_array() / 1000.)
res = pd.concat([s0, s1], axis=1)
res.index = res.index.values.astype('int64')

Note: I'm cheekily using a private method here which returns the cumcount as a numpy array...
Note2: This is pandas 0.14, in 0.13 you have to pass a numpy array to _cumcount_array eg np.arange(len(s0)) ), pre-0.13 you're out of luck - there's no cumcount.

How about this - convert to dataframes with labeled columns first, then concat().

s1 = pd.Series(index=['4/4/14','4/4/14','4/5/14'],
                      data=[12.2,0.0,12.2])
s2 = pd.Series(index=['4/5/14','4/8/14'],
                      data=[14.2,3.0])
d1 = pd.DataFrame(a,columns=['a'])
d2 = pd.DataFrame(b,columns=['b'])

final_df = pd.merge(d1, d2, left_index=True, right_index=True, how='outer')

This gives me

           a     b
4/4/14  12.2   NaN
4/4/14   0.0   NaN
4/5/14  12.2   14.2
4/8/14   NaN   3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM