根据Pandas中的堆栈列延长DataFrame

Question

I am looking for a function that achieves the following. 我正在寻找实现以下目的的功能。 It is best shown in an example. 最好在示例中显示。 Consider: 考虑：

pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2'])

which looks like: 看起来像：

   x  y1   y2
0  1   2  3
1  4   5  NaN

I would like to collapase the y1 and y2 columns, lengthening the DataFame where necessary, so that the output is: 我想折叠y1和y2列，在必要时加长DataFame，以便输出为：

That is, one row for each combination between either x and y1 , or x and y2 . 也就是说，对于x和y1或x和y2之间的每种组合，需要一行。 I am looking for a function that does this relatively efficiently, as I have multiple y s and many rows. 我正在寻找一个功能相对有效的函数，因为我有多个y和许多行。

Answer 1

You can use stack to get things done ie 您可以使用stack来完成任务，即

pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])

     x    y
0  1.0  2.0
1  1.0  3.0
2  4.0  5.0

Answer 2

Repeat all the items in first column based on counts of not null values in each row. 根据每一行中非空值的计数，重复第一列中的所有项目。 Then simply create your final dataframe using the rest of not null values in other columns. 然后，使用其他列中的其余非空值来简单地创建最终数据框。 You can use DataFrame.count() method to count not null values and numpy.repeat() to repeat an array based on a respective count array. 可以使用DataFrame.count()方法来算不为空值，并且numpy.repeat()重复基于一个相应的计数阵列上的阵列。

>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
                  'y': rest.values[rest.notna()]})

Demo: 演示：

>>> df
    x   y1   y2   y3   y4
0   1  2.0  3.0  NaN  6.0
1   4  5.0  NaN  9.0  3.0
2  10  NaN  NaN  NaN  NaN
3   9  NaN  NaN  6.0  NaN
4   7  6.0  NaN  NaN  NaN

>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
                  'y': rest.values[rest.notna()]})
   x    y
0  1  2.0
1  1  3.0
2  1  6.0
3  4  5.0
4  4  9.0
5  4  3.0
6  9  6.0
7  7  6.0

Answer 3

Here's one based on NumPy, as you were looking for performance - 这是基于NumPy的，您在寻找性能时-

def gather_columns(df):
    col_mask = [i.startswith('y') for i in df.columns]
    ally_vals = df.iloc[:,col_mask].values
    y_valid_mask = ~np.isnan(ally_vals)

    reps = np.count_nonzero(y_valid_mask, axis=1)
    x_vals = np.repeat(df.x.values, reps)
    y_vals = ally_vals[y_valid_mask]
    return pd.DataFrame({'x':x_vals, 'y':y_vals})

Sample run - 样品运行-

In [78]: df #(added more cols for variety)
Out[78]: 
   x  y1   y2   y5   y7
0  1   2  3.0  NaN  NaN
1  4   5  NaN  6.0  7.0

In [79]: gather_columns(df)
Out[79]: 
   x    y
0  1  2.0
1  1  3.0
2  4  5.0
3  4  6.0
4  4  7.0

If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so - 如果y列始终从第二列开始直到结尾，我们可以简单地对数据帧进行切片，从而进一步提高性能，如下所示-

def gather_columns_v2(df):
    ally_vals = df.iloc[:,1:].values
    y_valid_mask = ~np.isnan(ally_vals)

    reps = np.count_nonzero(y_valid_mask, axis=1)
    x_vals = np.repeat(df.x.values, reps)
    y_vals = ally_vals[y_valid_mask]
    return pd.DataFrame({'x':x_vals, 'y':y_vals})

根据Pandas中的堆栈列延长DataFrame

问题描述

3 个解决方案

解决方案1
3 2018-05-23 07:16:04

解决方案2
2 2018-05-23 07:09:50

解决方案3
1 已采纳 2018-05-23 06:58:01

根据Pandas中的堆栈列延长DataFrame

问题描述

3 个解决方案

解决方案1 3 2018-05-23 07:16:04

解决方案2 2 2018-05-23 07:09:50

解决方案3 1 已采纳 2018-05-23 06:58:01

解决方案1
3 2018-05-23 07:16:04

解决方案2
2 2018-05-23 07:09:50

解决方案3
1 已采纳 2018-05-23 06:58:01