[英]Naming Pandas Series while stacking from DataFrame
A common workflow that I have in pandas is getting data from some numerical function in "wide" form and turning it into a "long" form dataframe for plotting and statistical modeling. 在熊猫中,我有一个常见的工作流程是从某些数值函数中以“宽”形式获取数据,然后将其转换为“长”形式的数据框以进行绘图和统计建模。
What I mean by wide form is that there is variable information encoding in the columns. 我广义上的意思是,列中有可变信息编码。 For instance, say I measured some value at each of 5 timepoints in 10 different subjects:
例如,假设我在10个不同的主题的5个时间点分别测量了一些值:
wide_df = pd.DataFrame(np.random.randn(10, 5),
index=pd.Series(list("abcdefghij"), name="subject"),
columns=pd.Series(np.arange(5) * 2, name="timepoint"))
print wide_df
timepoint 0 2 4 6 8
subject
a -0.670881 0.959608 -0.480081 0.142092 1.697058
b 2.369493 -0.561081 -0.183635 -0.807523 -0.421347
c -0.908420 0.629171 0.196728 -0.907443 0.264352
d -0.390138 -1.821304 -1.994605 0.225164 0.187649
e -0.860542 -0.998323 -0.490968 -0.815570 -1.009524
f -0.917390 -0.120567 -0.893095 -0.359155 -0.204112
g 0.557500 -1.522631 -1.175746 0.705043 -0.366932
h -0.817043 2.204493 -0.305202 0.464969 0.280027
i -1.137253 0.350984 0.095577 0.468167 -0.058058
j -0.569986 2.438580 -0.514894 0.860504 1.397393
[10 rows x 5 columns]
The quickest way I know how to wrangle this thing into a long form dataframe is using stack
and then reset_index
: 我知道如何将这件事
reset_index
成长格式的数据reset_index
,最快的方法是使用stack
,然后使用reset_index
:
long_df = wide_df.stack().reset_index()
print long_df.head()
subject timepoint 0
0 a 0 -0.670881
1 a 2 0.959608
2 a 4 -0.480081
3 a 6 0.142092
4 a 8 1.697058
[5 rows x 3 columns]
The problem is that my "value" column is now named 0
. 问题是我的“值”列现在命名为
0
。 I could do 我可以做
long_series = wide_df.stack()
long_series.name = "value"
long_df = long_series.reset_index()
But that is more typing, requires naming an intermediate variable, and mixes method calls with attribute assignment in a way that really breaks up my flow. 但这更多的类型化,需要命名一个中间变量,并以一种真正破坏我流程的方式将方法调用与属性分配混合在一起。
Is there a way to do this in one line? 有没有一种方法可以做到这一点? I thought maybe
df.stack
would take a name
argument, but it doesn't, and Series
objects don't seem to have a set_name
method that I can find. 我以为
df.stack
可能会带有一个name
参数,但事实并非如此,而且Series
对象似乎没有可以找到的set_name
方法。
I do know about pandas.melt
, but it seems like overkill in this case of "pure" wide table data, and it drops the subject
index which is important. 我确实知道
pandas.melt
,但是在这种“纯粹的”宽表数据的情况下,似乎pandas.melt
过头了,并且它删除了重要的subject
索引。 Is there another answer here? 这里还有其他答案吗?
Their is a name
argument to Series.reset_index for just this reason 正因为如此,它们是Series.reset_index的
name
参数
In [14]: wide_df.stack().reset_index(name='foo')
Out[14]:
subject timepoint foo
0 a 0 -0.179968
1 a 2 1.559283
2 a 4 1.020142
3 a 6 -0.899663
4 a 8 2.983990
5 b 0 0.586476
6 b 2 0.055108
7 b 4 1.834005
8 b 6 1.226371
9 b 8 0.953103
10 c 0 -0.919273
You could define this if you want to as well (and would be a nice add to DataFrame): 如果需要的话,也可以定义此名称(对DataFrame来说是一个不错的添加):
In [14]: def _melt(self, *args, **kwargs):
....: return pd.melt(self.reset_index(), *args, **kwargs)
....:
In [15]: DataFrame.melt = _melt
In [19]: wide_df.melt('subject',value_name='foo')
Out[19]:
subject timepoint foo
0 a 0 0.374912
1 b 0 -0.016272
2 c 0 -0.510553
3 d 0 -1.532472
4 e 0 -0.115107
5 f 0 -0.101772
6 g 0 -0.020966
7 h 0 0.427469
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.