[英]pandas.DataFrame: How to align / group and sort data by index?
I'm new to pandas and still don't have a good overview about its power and how to use it.我是熊猫的新手,仍然没有很好地了解它的力量和如何使用它。 So the problem is hopefully simple :)所以这个问题希望很简单:)
I have a DataFrame with a date-index and several columns (stocks and their Open and Close-prices).我有一个带有日期索引和几列(股票及其开盘价和收盘价)的 DataFrame。 Here is some example data for two stocks A
and B
:以下是两只股票A
和B
一些示例数据:
import pandas as pd
_ = pd.to_datetime
A_dt = [_('2018-01-04'), _('2018-01-01'), _('2018-01-05')]
B_dt = [_('2018-01-01'), _('2018-01-05'), _('2018-01-03'), _('2018-01-02')]
A_data = [(12, 11), (10, 9), (8, 9)]
B_data = [(2, 2), (3, 4), (4, 4), (5, 3)]
As you see the data is incomplete, different missing dates for each series.如您所见,数据不完整,每个系列的缺失日期不同。 I want to put these data together in a single dataframe with sorted row-index dt
and 4 columns (2 stocks x 2 time series each).我想将这些数据放在一个带有排序的行索引dt
和 4 列(每只 2 只股票 x 2 个时间序列)的单个数据框中。
When I do it this way, everything works fine (except that I'd like to change the column-levels and don't know how to do it):当我这样做时,一切正常(除了我想更改列级别但不知道该怎么做):
# MultiIndex on axis 0, then unstacking
i0_a = pd.MultiIndex.from_tuples([("A", x) for x in A_dt], names=['symbol', 'dt'])
i0_b = pd.MultiIndex.from_tuples([("B", x) for x in B_dt], names=['symbol', 'dt'])
df0_a = pd.DataFrame(A_data, index=i0_a, columns=["Open", "Close"])
df0_b = pd.DataFrame(B_data, index=i0_b, columns=["Open", "Close"])
df = pd.concat([df0_a, df0_b])
df = df.unstack('symbol') # this automatically sorts by dt.
print df
# Open Close
#symbol A B A B
#dt
#2018-01-01 10.0 2.0 9.0 2.0
#2018-01-02 NaN 5.0 NaN 3.0
#2018-01-03 NaN 4.0 NaN 4.0
#2018-01-04 12.0 NaN 11.0 NaN
#2018-01-05 8.0 3.0 9.0 4.0
However when I put the MultiIndex on the columns, things are different但是,当我将 MultiIndex 放在列上时,情况就不一样了
# MultiIndex on axis 1
i1_a = pd.MultiIndex.from_tuples([("A", "Open"), ("A", "Close")], names=['symbol', 'series'])
i1_b = pd.MultiIndex.from_tuples([("B", "Open"), ("B", "Close")], names=['symbol', 'series'])
df1_a = pd.DataFrame(A_data, index=A_dt, columns=i1_a)
df1_b = pd.DataFrame(B_data, index=B_dt, columns=i1_b)
df = pd.concat([df1_a, df1_b])
print df
#symbol A B
#series Close Open Close Open
#2018-01-04 11.0 12.0 NaN NaN
#2018-01-01 9.0 10.0 NaN NaN
#2018-01-05 9.0 8.0 NaN NaN
#2018-01-01 NaN NaN 2.0 2.0
#2018-01-05 NaN NaN 4.0 3.0
#2018-01-03 NaN NaN 4.0 4.0
#2018-01-02 NaN NaN 3.0 5.0
Edit : With jezraels answer I timed 3 different methods of concat / combining DataFrames.编辑:使用 jezraels 回答我计时了 3 种不同的连接/组合数据帧的方法。 My first approach is the fastest.我的第一种方法是最快的。 Using combine_first
turns out to be an order of magnitude slower than the other methods.结果证明使用combine_first
比其他方法慢一个数量级。 The size of the data is still kept very small in the example:在示例中,数据的大小仍然非常小:
import timeit
setup = """
import pandas as pd
import numpy as np
stocks = 20
steps = 20
features = 10
data = []
index_method1 = []
index_method2 = []
cols_method1 = []
cols_method2 = []
df = None
for s in range(stocks):
name = "stock{0}".format(s)
index = np.arange(steps)
data.append(np.random.rand(steps, features))
index_method1.append(pd.MultiIndex.from_tuples([(name, x) for x in index], names=['symbol', 'dt']))
index_method2.append(index)
cols_method1.append([chr(65 + x) for x in range(features)])
cols_method2.append(pd.MultiIndex.from_arrays([[name] * features, [chr(65 + x) for x in range(features)]], names=['symbol', 'series']))
"""
method1 = """
for s in range(stocks):
df_new = pd.DataFrame(data[s], index=index_method1[s], columns=cols_method1[s])
if s == 0:
df = df_new
else:
df = pd.concat([df, df_new])
df = df.unstack('symbol')
"""
method2 = """
for s in range(stocks):
df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
if s == 0:
df = df_new
else:
df = df.combine_first(df_new)
"""
method3 = """
for s in range(stocks):
df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
if s == 0:
df = df_new.stack()
else:
df = pd.concat([df, df_new.stack()], axis=1)
df = df.unstack().swaplevel(0,1, axis=1).sort_index(axis=1)
"""
print ("Multi-Index axis 0, then concat: {} s".format((timeit.timeit(method1, setup, number=1))))
print ("Multi-Index axis 1, combine_first: {} s".format((timeit.timeit(method2, setup, number=1))))
print ("Stack and then concat: {} s".format((timeit.timeit(method3, setup, number=1))))
Multi-Index axis 0, then concat: 0.134283173989 s
Multi-Index axis 1, combine_first: 5.02396191049 s
Stack and then concat: 0.272278263371 s
It is problem because both DataFrames have different MultiIndex
in columns, so no align.这是问题,因为两个 DataFrames 在列中都有不同的MultiIndex
,所以没有对齐。
Solution is stack
for Series
, concat
to 2 column DataFrame
, then unstack
and for correct order of MultiIndex
add swaplevel
and sort_index
:解决方案是stack
Series
, concat
到 2 列DataFrame
,然后DataFrame
unstack
并为MultiIndex
添加swaplevel
和sort_index
正确顺序:
df = (pd.concat([df1_a.stack(), df1_b.stack()], axis=1)
.unstack()
.swaplevel(0,1, axis=1)
.sort_index(axis=1))
print (df)
series Close Open
symbol A B A B
2018-01-01 9.0 2.0 10.0 2.0
2018-01-02 NaN 3.0 NaN 5.0
2018-01-03 NaN 4.0 NaN 4.0
2018-01-04 11.0 NaN 12.0 NaN
2018-01-05 9.0 4.0 8.0 3.0
But better is use combine_first
:但更好的是使用combine_first
:
df = df1_a.combine_first(df1_b)
print (df)
symbol A B
series Close Open Close Open
2018-01-01 9.0 10.0 2.0 2.0
2018-01-02 NaN NaN 3.0 5.0
2018-01-03 NaN NaN 4.0 4.0
2018-01-04 11.0 12.0 NaN NaN
2018-01-05 9.0 8.0 4.0 3.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.