简体   繁体   English

查找 DataFrame 中最常见的组合

[英]Find the most common combination in a DataFrame

I am using pandas and I trying to figure out a way that I can get the most common combinations of products people use in my datafile.我正在使用 pandas 并试图找出一种方法来获得人们在我的数据文件中使用的最常见的产品组合。

Supposing that each column of the next three AA, BB and CC represents a completely different product and the 0 value means that I don't use this product and the 1 that I do use it.假设接下来的三个AA、BB和CC的每一列代表一个完全不同的产品,0值表示我不使用这个产品,1表示我使用它。 Also, each row represents and a completely different buyer.此外,每一行代表一个完全不同的买家。

For example, the most common combination in my example is the products AA and CC because I have three people that use them as you can see in lines 1,4,5.例如,我的示例中最常见的组合是产品 AA 和 CC,因为我有三个人使用它们,如您在第 1、4、5 行中看到的那样。

My result I would like to be something like 'The most common combination is the products AA and CC which are used by 3 people'.我的结果是“最常见的组合是 3 人使用的产品 AA 和 CC”。

I hope I have explained to you better this time我希望这次我向你解释得更好

Below is an example of my DataFrame:下面是我的 DataFrame 的示例:

AA  | BB  | CC
_______________
1   | 0   |  1
0   | 0   |  1
0   | 1   |  0
1   | 0   |  1
1   | 0   |  1

Once you count duplicate rows , you just need to do a bit of work to get the corresponding labels.计算重复行数后,您只需要做一些工作即可获得相应的标签。

Here's how I would do it, though I'm not very familiar with Pandas so there's probably a better way.这就是我的做法,虽然我对 Pandas 不是很熟悉,所以可能有更好的方法。 Firstly, the df should be boolean.首先,df 应该是 boolean。

import pandas as pd

df = pd.DataFrame({
    'AA': [1, 0, 0, 1, 1],
    'BB': [0, 0, 1, 0, 0],
    'CC': [1, 1, 0, 1, 1]}
    ).astype(bool)

# Count duplicate rows
counts = df.groupby(df.columns.tolist()).size()
# Get most common rows
maxima = counts[counts==counts.max()]
for combination, count in maxima.iteritems():
    # Select matching labels
    labels = df.columns[list(combination)]
    print(*labels, count)

Output: Output:

AA CC 3

Partial results:部分结果:

>>> counts
AA     BB     CC   
False  False  True     1
       True   False    1
True   False  True     3
dtype: int64

>>> maxima
AA    BB     CC  
True  False  True    3
dtype: int64

I was able to figure out almost the solution to my question before your response, but you wjandrea were partially correct, so thank you.在您回复之前,我几乎能够找出我的问题的解决方案,但是您的wjandrea部分正确,所以谢谢。

First, I had to go through the whole dataframe, row by row, looking for the one value each time like this and get the product name that I have 1.首先,我必须逐行遍历整个 dataframe 的 go,每次都像这样寻找一个值并获得我拥有的产品名称 1。

combination = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
combination = pd.DataFrame(combination)

After that, I created a new column with the names of the products each user use which I had to separate likes.之后,我创建了一个新列,其中包含每个用户使用的产品名称,我必须将它们分开喜欢。

df['Products'] = [' , '.join(map(str, l)) for l in combination[0]]

Then I just used your code and I get exactly what I wanted然后我就使用了你的代码,我得到了我想要的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM