pandas dataframe list columns having some value for each row

Question

I have a dataset of 400+ columns where the first column is a company identifier, the second is an article identifier and the others some attributes of the article. There are > 50.000 companies and up to 1.000 articles per company. For most companies, the attribute values (of importance for me) of all articles are identical, but not for all. I am using python dataframes to analyse the data. I'd like to add a column where all differing columns for each company are listed.

Example (using ints for article and company for easier reading):

import pandas as pd
df = pd.DataFrame({'company':[1,1,2,2,3,3], 'article':[1,2,1,2,1,2], 'col1':[1,1,2,2,3,3], 'col2':[1,2,3,3,4,4], 'col3':[1,2,3,3,4,5] })
diff = df.groupby('company').nunique()
diff['diff_columns'] = ???
diff[['company', 'diff_columns']]

The result should look like this:

company   diff_columns
1         ['col2', 'col3']
2         []
3         ['col3']

How can I achieve that?

Answer 1

You can count the value in each column. Then use itertools.compress() to filter list by the boolean list.

import itertools

columns_to_diff = ['col1', 'col2', 'col3']

diff = df.groupby('company').apply(lambda group: list(itertools.compress(columns_to_diff, [(len(group[col].value_counts()) != 1) for col in columns_to_diff])))

print(diff.to_frame('diff_columns'))

         diff_columns
company              
1        [col2, col3]
2                  []
3              [col3]

pandas dataframe list columns having some value for each row

Question

1 answers

solution1
1 ACCPTED 2021-04-26 09:32:38

pandas dataframe list columns having some value for each row

Question

1 answers

solution1 1 ACCPTED 2021-04-26 09:32:38

solution1
1 ACCPTED 2021-04-26 09:32:38