简体   繁体   中英

pandas.DataFrame: How to align / group and sort data by index?

I'm new to pandas and still don't have a good overview about its power and how to use it. So the problem is hopefully simple :)

I have a DataFrame with a date-index and several columns (stocks and their Open and Close-prices). Here is some example data for two stocks A and B :

import pandas as pd
_ = pd.to_datetime
A_dt = [_('2018-01-04'), _('2018-01-01'), _('2018-01-05')]
B_dt = [_('2018-01-01'), _('2018-01-05'), _('2018-01-03'), _('2018-01-02')]
A_data = [(12, 11), (10, 9), (8, 9)]
B_data = [(2, 2), (3, 4), (4, 4), (5, 3)]

As you see the data is incomplete, different missing dates for each series. I want to put these data together in a single dataframe with sorted row-index dt and 4 columns (2 stocks x 2 time series each).

When I do it this way, everything works fine (except that I'd like to change the column-levels and don't know how to do it):

# MultiIndex on axis 0, then unstacking
i0_a = pd.MultiIndex.from_tuples([("A", x) for x in A_dt], names=['symbol', 'dt'])
i0_b = pd.MultiIndex.from_tuples([("B", x) for x in B_dt], names=['symbol', 'dt'])

df0_a = pd.DataFrame(A_data, index=i0_a, columns=["Open", "Close"])
df0_b = pd.DataFrame(B_data, index=i0_b, columns=["Open", "Close"])

df = pd.concat([df0_a, df0_b])

df = df.unstack('symbol')  # this automatically sorts by dt.
print df

#            Open      Close
#symbol         A    B     A    B
#dt
#2018-01-01  10.0  2.0   9.0  2.0
#2018-01-02   NaN  5.0   NaN  3.0
#2018-01-03   NaN  4.0   NaN  4.0
#2018-01-04  12.0  NaN  11.0  NaN
#2018-01-05   8.0  3.0   9.0  4.0

However when I put the MultiIndex on the columns, things are different

# MultiIndex on axis 1
i1_a = pd.MultiIndex.from_tuples([("A", "Open"), ("A", "Close")], names=['symbol', 'series'])
i1_b = pd.MultiIndex.from_tuples([("B", "Open"), ("B", "Close")], names=['symbol', 'series'])

df1_a = pd.DataFrame(A_data, index=A_dt, columns=i1_a)
df1_b = pd.DataFrame(B_data, index=B_dt, columns=i1_b)

df = pd.concat([df1_a, df1_b])

print df

#symbol         A           B
#series     Close  Open Close Open
#2018-01-04  11.0  12.0   NaN  NaN
#2018-01-01   9.0  10.0   NaN  NaN
#2018-01-05   9.0   8.0   NaN  NaN
#2018-01-01   NaN   NaN   2.0  2.0
#2018-01-05   NaN   NaN   4.0  3.0
#2018-01-03   NaN   NaN   4.0  4.0
#2018-01-02   NaN   NaN   3.0  5.0
  1. Why isn't the data aligned automatically in this case, but in the other?
  2. How can I align and sort it in the second example?
  3. Which method would probably be faster on a large dataset (about 5000 stocks, 1000 timesteps and not only 2 series per stock (Open, Close), but about 20)? This will finally be used as input for a keras machine learning model.

Edit : With jezraels answer I timed 3 different methods of concat / combining DataFrames. My first approach is the fastest. Using combine_first turns out to be an order of magnitude slower than the other methods. The size of the data is still kept very small in the example:

import timeit
setup = """
import pandas as pd
import numpy as np

stocks = 20
steps = 20
features = 10

data = []
index_method1 = []
index_method2 = []
cols_method1 = []
cols_method2 = []

df = None
for s in range(stocks):
    name = "stock{0}".format(s)
    index = np.arange(steps)
    data.append(np.random.rand(steps, features))
    index_method1.append(pd.MultiIndex.from_tuples([(name, x) for x in index], names=['symbol', 'dt']))
    index_method2.append(index)
    cols_method1.append([chr(65 + x) for x in range(features)])
    cols_method2.append(pd.MultiIndex.from_arrays([[name] * features, [chr(65 + x) for x in range(features)]], names=['symbol', 'series']))
"""

method1 = """
for s in range(stocks):
    df_new = pd.DataFrame(data[s], index=index_method1[s], columns=cols_method1[s])
    if s == 0:
        df = df_new
    else:
        df = pd.concat([df, df_new])
df = df.unstack('symbol')
"""

method2 = """
for s in range(stocks):
    df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
    if s == 0:
        df = df_new
    else:
        df = df.combine_first(df_new)
"""

method3 = """
for s in range(stocks):
    df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
    if s == 0:
        df = df_new.stack()
    else:
        df = pd.concat([df, df_new.stack()], axis=1)

df = df.unstack().swaplevel(0,1, axis=1).sort_index(axis=1)
"""

print ("Multi-Index axis 0, then concat: {} s".format((timeit.timeit(method1, setup, number=1))))
print ("Multi-Index axis 1, combine_first: {} s".format((timeit.timeit(method2, setup, number=1))))
print ("Stack and then concat: {} s".format((timeit.timeit(method3, setup, number=1))))

Multi-Index axis 0, then concat: 0.134283173989 s
Multi-Index axis 1, combine_first: 5.02396191049 s
Stack and then concat: 0.272278263371 s

It is problem because both DataFrames have different MultiIndex in columns, so no align.

Solution is stack for Series , concat to 2 column DataFrame , then unstack and for correct order of MultiIndex add swaplevel and sort_index :

df = (pd.concat([df1_a.stack(), df1_b.stack()], axis=1)
        .unstack()
        .swaplevel(0,1, axis=1)
        .sort_index(axis=1))
print (df)
series     Close       Open     
symbol         A    B     A    B
2018-01-01   9.0  2.0  10.0  2.0
2018-01-02   NaN  3.0   NaN  5.0
2018-01-03   NaN  4.0   NaN  4.0
2018-01-04  11.0  NaN  12.0  NaN
2018-01-05   9.0  4.0   8.0  3.0

But better is use combine_first :

df = df1_a.combine_first(df1_b)
print (df)
symbol         A           B     
series     Close  Open Close Open
2018-01-01   9.0  10.0   2.0  2.0
2018-01-02   NaN   NaN   3.0  5.0
2018-01-03   NaN   NaN   4.0  4.0
2018-01-04  11.0  12.0   NaN  NaN
2018-01-05   9.0   8.0   4.0  3.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM