简体   繁体   中英

Find the most common combination in a DataFrame

I am using pandas and I trying to figure out a way that I can get the most common combinations of products people use in my datafile.

Supposing that each column of the next three AA, BB and CC represents a completely different product and the 0 value means that I don't use this product and the 1 that I do use it. Also, each row represents and a completely different buyer.

For example, the most common combination in my example is the products AA and CC because I have three people that use them as you can see in lines 1,4,5.

My result I would like to be something like 'The most common combination is the products AA and CC which are used by 3 people'.

I hope I have explained to you better this time

Below is an example of my DataFrame:

AA  | BB  | CC
_______________
1   | 0   |  1
0   | 0   |  1
0   | 1   |  0
1   | 0   |  1
1   | 0   |  1

Once you count duplicate rows , you just need to do a bit of work to get the corresponding labels.

Here's how I would do it, though I'm not very familiar with Pandas so there's probably a better way. Firstly, the df should be boolean.

import pandas as pd

df = pd.DataFrame({
    'AA': [1, 0, 0, 1, 1],
    'BB': [0, 0, 1, 0, 0],
    'CC': [1, 1, 0, 1, 1]}
    ).astype(bool)

# Count duplicate rows
counts = df.groupby(df.columns.tolist()).size()
# Get most common rows
maxima = counts[counts==counts.max()]
for combination, count in maxima.iteritems():
    # Select matching labels
    labels = df.columns[list(combination)]
    print(*labels, count)

Output:

AA CC 3

Partial results:

>>> counts
AA     BB     CC   
False  False  True     1
       True   False    1
True   False  True     3
dtype: int64

>>> maxima
AA    BB     CC  
True  False  True    3
dtype: int64

I was able to figure out almost the solution to my question before your response, but you wjandrea were partially correct, so thank you.

First, I had to go through the whole dataframe, row by row, looking for the one value each time like this and get the product name that I have 1.

combination = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
combination = pd.DataFrame(combination)

After that, I created a new column with the names of the products each user use which I had to separate likes.

df['Products'] = [' , '.join(map(str, l)) for l in combination[0]]

Then I just used your code and I get exactly what I wanted

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM