I have a dataset of 400+ columns where the first column is a company identifier, the second is an article identifier and the others some attributes of the article. There are > 50.000 companies and up to 1.000 articles per company. For most companies, the attribute values (of importance for me) of all articles are identical, but not for all. I am using python dataframes to analyse the data. I'd like to add a column where all differing columns for each company are listed.
Example (using ints for article and company for easier reading):
import pandas as pd
df = pd.DataFrame({'company':[1,1,2,2,3,3], 'article':[1,2,1,2,1,2], 'col1':[1,1,2,2,3,3], 'col2':[1,2,3,3,4,4], 'col3':[1,2,3,3,4,5] })
diff = df.groupby('company').nunique()
diff['diff_columns'] = ???
diff[['company', 'diff_columns']]
The result should look like this:
company diff_columns
1 ['col2', 'col3']
2 []
3 ['col3']
How can I achieve that?
You can count the value in each column. Then use itertools.compress() to filter list by the boolean list.
import itertools
columns_to_diff = ['col1', 'col2', 'col3']
diff = df.groupby('company').apply(lambda group: list(itertools.compress(columns_to_diff, [(len(group[col].value_counts()) != 1) for col in columns_to_diff])))
print(diff.to_frame('diff_columns'))
diff_columns
company
1 [col2, col3]
2 []
3 [col3]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.