[英]Reshaping a pandas dataframe without a true index
I've been trying to construct a dataframe of regression outputs for the past few weeks, and I've gotten most of the way there. 在过去的几周中,我一直在尝试构建回归输出的数据框架,而我已经掌握了大部分方法。 I am now trying to reshape it around certain keywords in a column that as far as I can tell is not callable.
我现在正在尝试根据我认为无法调用的列中的某些关键字来重塑它。
A simplified version of my dataframe looks like: 我的数据框的简化版本如下所示:
? coef pval se rsq
Intercept 1 0 .1 .1
Cash 2 0.2 .05 .1
Food 2 0.05 .2 .1
Intercept 3 0 .1 .2
Cash 1 0.01 .2 .2
Food 2 0.3 .1 .2
Zone 1 0.4 .3 .2
What I'm trying to achieve is: 我想要实现的是:
(1) (2)
Intercept coef 1 3
Intercept pval 0 0
Intercept se 0.1 0.1
Cash coef 2 1
Cash pval 0.2 0.01
Cash se 0.05 0.2
Food coef 2 2
Food pval 0.05 0.3
Food se 0.2 0.1
Zone coef NaN 1
Zone pval NaN 0.4
Zone se NaN 0.3
rsq 0.1 0.2
So far I've tried several approaches, with the promising being reshaping using r-squared (rsq) as an index -> RegTable = RegTable.pivot(index='rsq', columns=['pval', 'coef', 'robust_se'])
到目前为止,我已经尝试了几种方法,并有望通过使用r平方(rsq)作为索引来重塑->
RegTable = RegTable.pivot(index='rsq', columns=['pval', 'coef', 'robust_se'])
This, however, returns the error ValueError: all arrays must be same length
. 但是,这将返回错误
ValueError: all arrays must be same length
。 Some research makes me think this is because as of right now, zone = NaN
is represented by the regression simply not having a zone row, but I'm not sure how to fix it. 一些研究使我认为这是因为截至目前,
zone = NaN
由回归表示,只是没有区域行,但是我不确定如何解决。 In addition, I've been unable to figure out how to call the column I identified as "?" 另外,我一直无法弄清楚如何调用我标识为“?”的列。 using PANDAS - it's not labeled in the CSV output.
使用PANDAS-在CSV输出中未标记。 In addition, this approach seems problematic as in the off chance that two regressions have the same r-squared, it will either end up throwing a new errror or averaging each value, neither of which are exactly desirable.
此外,这种方法似乎有问题,因为两个回归具有相同的r平方的可能性很小,它要么最终抛出一个新的错误,要么平均每个值,而这两个都不是我们所希望的。
Let's try this: 让我们尝试一下:
df.set_index(['rsq','?']).stack().unstack([0]).T\
.reset_index().T.rename_axis([None,None])\
.rename(columns={0:'(1)',1:'(2)'})\
.sort_index()
Where df: 哪里df:
? coef pval se rsq
0 Intercept 1 0.00 0.10 0.1
1 Cash 2 0.20 0.05 0.1
2 Food 2 0.05 0.20 0.1
3 Intercept 3 0.00 0.10 0.2
4 Cash 1 0.01 0.20 0.2
5 Food 2 0.30 0.10 0.2
6 Zone 1 0.40 0.30 0.2
Output: 输出:
(1) (2)
Cash coef 2.00 1.00
pval 0.20 0.01
se 0.05 0.20
Food coef 2.00 2.00
pval 0.05 0.30
se 0.20 0.10
Intercept coef 1.00 3.00
pval 0.00 0.00
se 0.10 0.10
Zone coef NaN 1.00
pval NaN 0.40
se NaN 0.30
rsq 0.10 0.20
The intuition behind this solution is to slice the dataframe into two roughly equal parts. 该解决方案的直觉是将数据帧分成两个大致相等的部分。 The assumption here is that you only have two sets of data points, so this becomes manageable.
这里的假设是您只有两组数据点,因此这变得易于管理。
print(df)
coef pval se rsq
Intercept 1 0.00 0.10 0.1
Cash 2 0.20 0.05 0.1
Food 2 0.05 0.20 0.1
Intercept 3 0.00 0.10 0.2
Cash 1 0.01 0.20 0.2
Food 2 0.30 0.10 0.2
Zone 1 0.40 0.30 0.2
df_ = df.reset_index().iloc[:, :-1]
df2 = df_.iloc[df_['index'].drop_duplicates(keep='first').to_frame().index]
df1 = df_.iloc[df_['index'].drop_duplicates(keep='last')\
.to_frame().index.difference(df2.index)]
Once this is done, each piece must be stacked and then concatenated along the first axis. 完成此操作后,必须将每个部件堆叠起来,然后沿第一个轴连接。
out = pd.concat([df1.set_index('index').stack(),\
df2.set_index('index').stack()], 1)
out.append(pd.DataFrame([df.rsq.unique()], index=[('rsq', '')]))
out.columns = ['1', '2']
print(out)
1 2
index
Cash coef 1.00 2.00
pval 0.01 0.20
se 0.20 0.05
Food coef 2.00 2.00
pval 0.30 0.05
se 0.10 0.20
Intercept coef 3.00 1.00
pval 0.00 0.00
se 0.10 0.10
Zone coef NaN 1.00
pval NaN 0.40
se NaN 0.30
rsq 0.10 0.20
Here's a slightly eccentric way to do this without splitting up into two data frames. 这是一种有点古怪的方法,无需拆分成两个数据帧。
This solution renames the indices to keep track of the regression they belong to, adding in NaN
when there's a missing field (as is the case for Zone
). 此解决方案重命名了索引以跟踪它们所属的回归,并在缺少字段时添加了
NaN
(与Zone
)。
Then groupby
, stack
, and split the column of lists into (1)
and (2)
columns (although it's generalized to handle as many regressions as occur in the data). 然后
groupby
, stack
和将列表的列分为(1)
和(2)
列(尽管一般来说它可以处理与数据中出现的一样多的回归)。
With df
as: 使用
df
作为:
coef pval se rsq
Intercept 1 0 .1 .1
Cash 2 0.2 .05 .1
Food 2 0.05 .2 .1
Intercept 3 0 .1 .2
Cash 1 0.01 .2 .2
Food 2 0.3 .1 .2
Zone 1 0.4 .3 .2
Rename index values as Intercept0
, Intercept1
, etc: 将索引值重命名为
Intercept0
, Intercept1
等:
measures = df.index.unique()
found = {m:0 for m in measures}
for i, name in enumerate(df.index):
if np.max(list(found.values())) > found[name]+1:
df.loc["{}{}".format(name, found[name])] = np.nan
found[name] += 1
df.index.values[i] = "{}{}".format(name, found[name])
found[name] += 1
df
coef pval se rsq
Intercept0 1.0 0.00 0.10 0.1
Cash0 2.0 0.20 0.05 0.1
Food0 2.0 0.05 0.20 0.1
Intercept1 3.0 0.00 0.10 0.2
Cash1 1.0 0.01 0.20 0.2
Food1 2.0 0.30 0.10 0.2
Zone1 1.0 0.40 0.30 0.2
Zone0 NaN NaN NaN NaN
Now arrange rows so that elements from each regression are grouped together. 现在排列行,以便将每个回归中的元素组合在一起。 (This is mainly necessary to get the
NaN
rows in the right spot.) (这对于将
NaN
行放在正确的位置是主要必要的。)
order_by_reg = sorted(df.index, key=lambda x: ''.join(reversed(x)))
df = df.loc[order_by_reg]
df
coef pval se rsq
Food0 2.0 0.05 0.20 0.1
Zone0 NaN NaN NaN NaN
Cash0 2.0 0.20 0.05 0.1
Intercept0 1.0 0.00 0.10 0.1
Food1 2.0 0.30 0.10 0.2
Zone1 1.0 0.40 0.30 0.2
Cash1 1.0 0.01 0.20 0.2
Intercept1 3.0 0.00 0.10 0.2
Finally, groupby
, stack
, and split the resulting column of lists with apply(pd.Series)
: 最后,
groupby
, stack
和使用apply(pd.Series)
拆分列表的结果列:
gb = (df.groupby(lambda x: x[:-1])
.agg(lambda x: list(x))
.stack()
.apply(lambda pair: pd.Series({"({})".format(i):el for i, el in enumerate(pair)})))
gb
(0) (1)
Cash coef 2.00 1.00
pval 0.20 0.01
se 0.05 0.20
rsq 0.10 0.20
Food coef 2.00 2.00
pval 0.05 0.30
se 0.20 0.10
rsq 0.10 0.20
Intercept coef 1.00 3.00
pval 0.00 0.00
se 0.10 0.10
rsq 0.10 0.20
Zone coef NaN 1.00
pval NaN 0.40
se NaN 0.30
rsq NaN 0.20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.