简体   繁体   English

在没有真实索引的情况下重塑熊猫数据框

[英]Reshaping a pandas dataframe without a true index

I've been trying to construct a dataframe of regression outputs for the past few weeks, and I've gotten most of the way there. 在过去的几周中,我一直在尝试构建回归输出的数据框架,而我已经掌握了大部分方法。 I am now trying to reshape it around certain keywords in a column that as far as I can tell is not callable. 我现在正在尝试根据我认为无法调用的列中的某些关键字来重塑它。

A simplified version of my dataframe looks like: 我的数据框的简化版本如下所示:

?           coef pval  se   rsq
Intercept   1    0     .1   .1
Cash        2    0.2   .05  .1
Food        2    0.05  .2   .1
Intercept   3    0     .1   .2
Cash        1    0.01  .2   .2
Food        2    0.3   .1   .2
Zone        1    0.4   .3   .2

What I'm trying to achieve is: 我想要实现的是:

                (1)      (2)
Intercept coef   1        3
Intercept pval   0        0
Intercept se     0.1      0.1
Cash coef        2        1
Cash pval        0.2      0.01
Cash se          0.05     0.2
Food coef        2        2
Food pval        0.05     0.3
Food se          0.2      0.1
Zone coef        NaN      1
Zone pval        NaN      0.4
Zone se          NaN      0.3
rsq              0.1      0.2

So far I've tried several approaches, with the promising being reshaping using r-squared (rsq) as an index -> RegTable = RegTable.pivot(index='rsq', columns=['pval', 'coef', 'robust_se']) 到目前为止,我已经尝试了几种方法,并有望通过使用r平方(rsq)作为索引来重塑-> RegTable = RegTable.pivot(index='rsq', columns=['pval', 'coef', 'robust_se'])

This, however, returns the error ValueError: all arrays must be same length . 但是,这将返回错误ValueError: all arrays must be same length Some research makes me think this is because as of right now, zone = NaN is represented by the regression simply not having a zone row, but I'm not sure how to fix it. 一些研究使我认为这是因为截至目前, zone = NaN由回归表示,只是没有区域行,但是我不确定如何解决。 In addition, I've been unable to figure out how to call the column I identified as "?" 另外,我一直无法弄清楚如何调用我标识为“?”的列。 using PANDAS - it's not labeled in the CSV output. 使用PANDAS-在CSV输出中未标记。 In addition, this approach seems problematic as in the off chance that two regressions have the same r-squared, it will either end up throwing a new errror or averaging each value, neither of which are exactly desirable. 此外,这种方法似乎有问题,因为两个回归具有相同的r平方的可能性很小,它要么最终抛出一个新的错误,要么平均每个值,而这两个都不是我们所希望的。

Let's try this: 让我们尝试一下:

df.set_index(['rsq','?']).stack().unstack([0]).T\
  .reset_index().T.rename_axis([None,None])\
  .rename(columns={0:'(1)',1:'(2)'})\
  .sort_index()

Where df: 哪里df:

           ?  coef  pval    se  rsq
0  Intercept     1  0.00  0.10  0.1
1       Cash     2  0.20  0.05  0.1
2       Food     2  0.05  0.20  0.1
3  Intercept     3  0.00  0.10  0.2
4       Cash     1  0.01  0.20  0.2
5       Food     2  0.30  0.10  0.2
6       Zone     1  0.40  0.30  0.2

Output: 输出:

                 (1)   (2)
Cash      coef  2.00  1.00
          pval  0.20  0.01
          se    0.05  0.20
Food      coef  2.00  2.00
          pval  0.05  0.30
          se    0.20  0.10
Intercept coef  1.00  3.00
          pval  0.00  0.00
          se    0.10  0.10
Zone      coef   NaN  1.00
          pval   NaN  0.40
          se     NaN  0.30
rsq             0.10  0.20

The intuition behind this solution is to slice the dataframe into two roughly equal parts. 该解决方案的直觉是将数据帧分成两个大致相等的部分。 The assumption here is that you only have two sets of data points, so this becomes manageable. 这里的假设是您只有两组数据点,因此这变得易于管理。

print(df)

           coef  pval    se  rsq
Intercept     1  0.00  0.10  0.1
Cash          2  0.20  0.05  0.1
Food          2  0.05  0.20  0.1
Intercept     3  0.00  0.10  0.2
Cash          1  0.01  0.20  0.2
Food          2  0.30  0.10  0.2
Zone          1  0.40  0.30  0.2


df_ = df.reset_index().iloc[:, :-1]

df2 = df_.iloc[df_['index'].drop_duplicates(keep='first').to_frame().index]
df1 = df_.iloc[df_['index'].drop_duplicates(keep='last')\
                 .to_frame().index.difference(df2.index)]

Once this is done, each piece must be stacked and then concatenated along the first axis. 完成此操作后,必须将每个部件堆叠起来,然后沿第一个轴连接。

out = pd.concat([df1.set_index('index').stack(),\
                 df2.set_index('index').stack()], 1)
out.append(pd.DataFrame([df.rsq.unique()], index=[('rsq', '')]))
out.columns = ['1', '2']

print(out) 

                   1     2
index                     
Cash      coef  1.00  2.00
          pval  0.01  0.20
          se    0.20  0.05
Food      coef  2.00  2.00
          pval  0.30  0.05
          se    0.10  0.20
Intercept coef  3.00  1.00
          pval  0.00  0.00
          se    0.10  0.10
Zone      coef   NaN  1.00
          pval   NaN  0.40
          se     NaN  0.30
rsq             0.10  0.20

Here's a slightly eccentric way to do this without splitting up into two data frames. 这是一种有点古怪的方法,无需拆分成两个数据帧。

This solution renames the indices to keep track of the regression they belong to, adding in NaN when there's a missing field (as is the case for Zone ). 此解决方案重命名了索引以跟踪它们所属的回归,并在缺少字段时添加了NaN (与Zone )。
Then groupby , stack , and split the column of lists into (1) and (2) columns (although it's generalized to handle as many regressions as occur in the data). 然后groupbystack和将列表的列分为(1)(2)列(尽管一般来说它可以处理与数据中出现的一样多的回归)。

With df as: 使用df作为:

            coef pval  se   rsq
Intercept   1    0     .1   .1
Cash        2    0.2   .05  .1
Food        2    0.05  .2   .1
Intercept   3    0     .1   .2
Cash        1    0.01  .2   .2
Food        2    0.3   .1   .2
Zone        1    0.4   .3   .2

Rename index values as Intercept0 , Intercept1 , etc: 将索引值重命名为Intercept0Intercept1等:

measures = df.index.unique()
found = {m:0 for m in measures}

for i, name in enumerate(df.index):
    if np.max(list(found.values())) > found[name]+1:
        df.loc["{}{}".format(name, found[name])] = np.nan
        found[name] += 1
    df.index.values[i] = "{}{}".format(name, found[name])
    found[name] += 1

df
            coef  pval    se  rsq
Intercept0   1.0  0.00  0.10  0.1
Cash0        2.0  0.20  0.05  0.1
Food0        2.0  0.05  0.20  0.1
Intercept1   3.0  0.00  0.10  0.2
Cash1        1.0  0.01  0.20  0.2
Food1        2.0  0.30  0.10  0.2
Zone1        1.0  0.40  0.30  0.2
Zone0        NaN   NaN   NaN  NaN

Now arrange rows so that elements from each regression are grouped together. 现在排列行,以便将每个回归中的元素组合在一起。 (This is mainly necessary to get the NaN rows in the right spot.) (这对于将NaN行放在正确的位置是主要必要的。)

order_by_reg = sorted(df.index, key=lambda x: ''.join(reversed(x)))
df = df.loc[order_by_reg]

df
            coef  pval    se  rsq
Food0        2.0  0.05  0.20  0.1
Zone0        NaN   NaN   NaN  NaN
Cash0        2.0  0.20  0.05  0.1
Intercept0   1.0  0.00  0.10  0.1
Food1        2.0  0.30  0.10  0.2
Zone1        1.0  0.40  0.30  0.2
Cash1        1.0  0.01  0.20  0.2
Intercept1   3.0  0.00  0.10  0.2

Finally, groupby , stack , and split the resulting column of lists with apply(pd.Series) : 最后, groupbystack和使用apply(pd.Series)拆分列表的结果列:

gb = (df.groupby(lambda x: x[:-1])
        .agg(lambda x: list(x))
        .stack()
        .apply(lambda pair: pd.Series({"({})".format(i):el for i, el in enumerate(pair)})))

gb
                 (0)   (1)
Cash      coef  2.00  1.00
          pval  0.20  0.01
          se    0.05  0.20
          rsq   0.10  0.20
Food      coef  2.00  2.00
          pval  0.05  0.30
          se    0.20  0.10
          rsq   0.10  0.20
Intercept coef  1.00  3.00
          pval  0.00  0.00
          se    0.10  0.10
          rsq   0.10  0.20
Zone      coef   NaN  1.00
          pval   NaN  0.40
          se     NaN  0.30
          rsq    NaN  0.20

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM