简体   繁体   English

根据Pandas中的堆栈列延长DataFrame

[英]Lengthening a DataFrame based on stacking columns within it in Pandas

I am looking for a function that achieves the following. 我正在寻找实现以下目的的功能。 It is best shown in an example. 最好在示例中显示。 Consider: 考虑:

pd.DataFrame([ [1, 2, 3 ], [4, 5, np.nan ]], columns=['x', 'y1', 'y2'])

which looks like: 看起来像:

   x  y1   y2
0  1   2  3
1  4   5  NaN

I would like to collapase the y1 and y2 columns, lengthening the DataFame where necessary, so that the output is: 我想折叠y1y2列,在必要时加长DataFame,以便输出为:

   x  y
0  1   2  
1  1   3  
2  4   5  

That is, one row for each combination between either x and y1 , or x and y2 . 也就是说,对于xy1xy2之间的每种组合,需要一行。 I am looking for a function that does this relatively efficiently, as I have multiple y s and many rows. 我正在寻找一个功能相对有效的函数,因为我有多个y和许多行。

You can use stack to get things done ie 您可以使用stack来完成任务,即

pd.DataFrame(df.set_index('x').stack().reset_index(level=0).values,columns=['x','y'])

     x    y
0  1.0  2.0
1  1.0  3.0
2  4.0  5.0

Repeat all the items in first column based on counts of not null values in each row. 根据每一行中非空值的计数,重复第一列中的所有项目。 Then simply create your final dataframe using the rest of not null values in other columns. 然后,使用其他列中的其余非空值来简单地创建最终数据框。 You can use DataFrame.count() method to count not null values and numpy.repeat() to repeat an array based on a respective count array. 可以使用DataFrame.count()方法来算不为空值,并且numpy.repeat()重复基于一个相应的计数阵列上的阵列。

>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
                  'y': rest.values[rest.notna()]})

Demo: 演示:

>>> df
    x   y1   y2   y3   y4
0   1  2.0  3.0  NaN  6.0
1   4  5.0  NaN  9.0  3.0
2  10  NaN  NaN  NaN  NaN
3   9  NaN  NaN  6.0  NaN
4   7  6.0  NaN  NaN  NaN

>>> rest = df.loc[:,'y1':]
>>> pd.DataFrame({'x': np.repeat(df['x'], rest.count(1)).values,
                  'y': rest.values[rest.notna()]})
   x    y
0  1  2.0
1  1  3.0
2  1  6.0
3  4  5.0
4  4  9.0
5  4  3.0
6  9  6.0
7  7  6.0

Here's one based on NumPy, as you were looking for performance - 这是基于NumPy的,您在寻找性能时-

def gather_columns(df):
    col_mask = [i.startswith('y') for i in df.columns]
    ally_vals = df.iloc[:,col_mask].values
    y_valid_mask = ~np.isnan(ally_vals)

    reps = np.count_nonzero(y_valid_mask, axis=1)
    x_vals = np.repeat(df.x.values, reps)
    y_vals = ally_vals[y_valid_mask]
    return pd.DataFrame({'x':x_vals, 'y':y_vals})

Sample run - 样品运行-

In [78]: df #(added more cols for variety)
Out[78]: 
   x  y1   y2   y5   y7
0  1   2  3.0  NaN  NaN
1  4   5  NaN  6.0  7.0

In [79]: gather_columns(df)
Out[79]: 
   x    y
0  1  2.0
1  1  3.0
2  4  5.0
3  4  6.0
4  4  7.0

If the y columns are always starting from the second column onwards until the end, we can simply slice the dataframe and hence get further performance boost, like so - 如果y列始终从第二列开始直到结尾,我们可以简单地对数据帧进行切片,从而进一步提高性能,如下所示-

def gather_columns_v2(df):
    ally_vals = df.iloc[:,1:].values
    y_valid_mask = ~np.isnan(ally_vals)

    reps = np.count_nonzero(y_valid_mask, axis=1)
    x_vals = np.repeat(df.x.values, reps)
    y_vals = ally_vals[y_valid_mask]
    return pd.DataFrame({'x':x_vals, 'y':y_vals})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM