简体   繁体   English

Pandas DataFrame使用其他列的名称聚合列作为值

[英]Pandas DataFrame aggregated column with names of other columns as value

I'm trying to create a new column in my DataFrame that is a list of aggregated column names. 我正在尝试在我的DataFrame中创建一个新列,它是一个聚合列名列表。 Here's a sample DataFrame : 这是一个示例DataFrame

In [1]: df = pd.DataFrame({'A':[1,2,3],
In [2]: df
   A  B  C  D  E  F
0  1  4  7  1  5  7
1  2  5  8  3  3  4
2  3  6  9  5  6  3

I'd like to create a new column containing a list of column names where a certain condition is met. 我想创建一个新列,其中包含满足特定条件的列名列表。 Say that I'm interested in columns where value > 3 -- I would want an output that looks like this: 假设我对值> 3的列感兴趣 - 我希望输出看起来像这样:

In [3]: df
   A  B  C  D  E  F  Flag
0  1  4  7  1  5  7  ['B', 'C', 'E', 'F']
1  2  5  8  3  3  4  ['B', 'C', 'F']
2  3  6  9  5  6  3  ['B', 'C', 'D', 'E']

Currently, I'm using apply : 目前,我正在使用apply

df['Flag'] = df.apply(lambda row: [list(df)[i] for i, j in enumerate(row) if j > 3], axis = 1)

This gets the job done, but feels clunky and I'm wondering if there is a more elegant solution. 这可以完成工作,但感觉笨重,我想知道是否有更优雅的解决方案。

Thanks! 谢谢!

Use df.dot() here: 在这里使用df.dot()


   A  B  C  D  E  F          Flag
0  1  4  7  1  5  7  [B, C, E, F]
1  2  5  8  3  3  4     [B, C, F]
2  3  6  9  5  6  3  [B, C, D, E]

I still like for loop here 我还是喜欢这里的循环

df['Flag']=[df.columns[x].tolist() for x in df.gt(3).values]
   A  B  C  D  E  F          Flag
0  1  4  7  1  5  7  [B, C, E, F]
1  2  5  8  3  3  4     [B, C, F]
2  3  6  9  5  6  3  [B, C, D, E]

One option is to create a dataframe of booleans by checking which values are above a certain threshold with DataFrame.gt , and take the dot product with the column names. 一种选择是通过使用DataFrame.gt检查哪些值高于某个阈值来创建booleans数据DataFrame.gt ,并使用带有列名称的dot积。 Finally use apply(list) to obtain lists from the resulting strings: 最后使用apply(list)从结果字符串中获取列表:

df['Flag'] = df.gt(3).dot(df.columns).apply(list)

   A  B  C  D  E  F          Flag
0  1  4  7  1  5  7  [B, C, E, F]
1  2  5  8  3  3  4     [B, C, F]
2  3  6  9  5  6  3  [B, C, D, E]


df['Flag'] = df.T.apply(lambda x: list(x[x>3].index))

Edit : adding timing for all solutions of this question 编辑为此问题的所有解决方案添加时间

I prefer a solution without apply 我更喜欢没有apply的解决方案

df['Flag'] = df.reset_index().melt(id_vars='index', value_name='val', var_name='col').query('val > 3').groupby('index')['col'].agg(list)

Or 要么

df['Flag'] = df.stack().rename('val').reset_index(level=1).query('val > 3').groupby(level=0)['level_1'].agg(list)

   A  B  C  D  E  F          Flag
0  1  4  7  1  5  7  [B, C, E, F]
1  2  5  8  3  3  4     [B, C, F]
2  3  6  9  5  6  3  [B, C, D, E]

Test data: 测试数据:

a = [
    [1,  4,  7,  1,  5,  7],
    [2,  5,  8,  3,  3,  4],
    [3,  6,  9,  5,  6,  3],
    ] * 10000

df = pd.DataFrame(a, columns = list('ABCDEF'))  

Timing with %timeit : 使用%timeit计时:

In [79]: %timeit (df>3).dot(df.columns).apply(list)
40.8 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [80]: %timeit [df.columns[x].tolist() for x in df.gt(3).values]
1.23 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [81]: %timeit df.gt(3).dot(df.columns).apply(list)
37.6 ms ± 644 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [82]: %timeit df.T.apply(lambda x: list(x[x>3].index))
16.4 s ± 99.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [83]: %timeit df.stack().rename('val').reset_index(level=1).query('val > 3')
    ...: .groupby(level=0)['level_1'].agg(list)
4.05 s ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [84]: %timeit df.apply(lambda x: df.columns[np.argwhere(x>3).ravel()].values
    ...: , 1)
c:\program files\python37\lib\site-packages\numpy\core\fromnumeric.py:56: Future
Warning: Series.nonzero() is deprecated and will be removed in a future version.
Use Series.to_numpy().nonzero() instead
  return getattr(obj, method)(*args, **kwds)
12 s ± 45.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Fastest are solution using .dot 最快的是使用.dot解决方案

使用numpy.argwhereravel ravel()

   df.apply(lambda x: df.columns[np.argwhere(x>3).ravel()].values, 1)


df['Flag'] = ((df >3) @ df.columns).map(list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM