pandas dataframe 列出每行具有某些值的列

Question

我有一个包含 400 多列的数据集，其中第一列是公司标识符，第二列是文章标识符，其他列是文章的一些属性。 有 > 50.000 家公司，每家公司最多 1.000 篇文章。 对于大多数公司来说，所有文章的属性值（对我来说很重要）都是相同的，但不是所有的。 我正在使用 python 数据帧来分析数据。 我想添加一个列，其中列出了每个公司的所有不同列。

示例（为文章和公司使用整数以便于阅读）：

import pandas as pd
df = pd.DataFrame({'company':[1,1,2,2,3,3], 'article':[1,2,1,2,1,2], 'col1':[1,1,2,2,3,3], 'col2':[1,2,3,3,4,4], 'col3':[1,2,3,3,4,5] })
diff = df.groupby('company').nunique()
diff['diff_columns'] = ???
diff[['company', 'diff_columns']]

结果应如下所示：

company   diff_columns
1         ['col2', 'col3']
2         []
3         ['col3']

我怎样才能做到这一点？

Answer 1

您可以计算每列中的值。 然后使用itertools.compress()按 boolean 列表过滤列表。

import itertools

columns_to_diff = ['col1', 'col2', 'col3']

diff = df.groupby('company').apply(lambda group: list(itertools.compress(columns_to_diff, [(len(group[col].value_counts()) != 1) for col in columns_to_diff])))

print(diff.to_frame('diff_columns'))

         diff_columns
company              
1        [col2, col3]
2                  []
3              [col3]

pandas dataframe 列出每行具有某些值的列

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-26 09:32:38

pandas dataframe 列出每行具有某些值的列

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-26 09:32:38

解决方案1
1 已采纳 2021-04-26 09:32:38