简体   繁体   English

对熊猫数据框中的条目进行分组

[英]Grouping entries in pandas dataframe

Hello I have a dataframe like this 您好我有一个这样的数据框

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'], 
'b':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James]})
df 
Item Name
A    Tom
A    John
A    Paul
B    Tom 
B    Frank
C    Tom
C    John
C    Richard
C    James

For each people I want the list of the people with same item and them time 对于每个人,我想要具有相同项目的人的名单,以及他们的时间

df1 
Name              People                          Times
Tom     [John, Paul, Frank, Richard, James]       [2,1,1,1,1]
John    [Tom, Richard, James]                     [2,1,1]
Paul    [Tom, John]                               [1,1]
Frank   [Tom]                                     [1]
Richard [Tom, John, James]                        [1,1,1]
James   [Tom, John, Richard]                      [1,1,1]

So far I have tried this to count the different people for the different items 到目前为止,我已经尝试过以此来计算不同项目的不同人

df.groupby("Item").agg({ "Name": pd.Series.nunique})
      Name
Item    
A      3
B      2
C      4

and

df.groupby("Name").agg({ "Item": pd.Series.nunique})
        Item
Name    
Frank   1
James   1
John    2
Paul    1
Richard 1
Tom     3

You can use numpy.unique for count items in lists: 您可以将numpy.unique用于列表中的计数项目:

print df
  Item     Name
0    A      Tom
1    A     John
2    A     Paul
3    B      Tom
4    B    Frank
5    C      Tom
6    C     John
7    C  Richard
8    C    James

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
    Name_x                                     Name_y
0    Frank                                      [Tom]
1    James                       [Tom, John, Richard]
2     John           [Tom, Paul, Tom, Richard, James]
3     Paul                                [Tom, John]
4  Richard                         [Tom, John, James]
5      Tom  [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
      Name                               People            times
0    Frank                                [Tom]              [1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]
3     Paul                          [John, Tom]           [1, 1]
4  Richard                   [James, John, Tom]        [1, 1, 1]
5      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]
ct = pd.crosstab(df.Name, df.Item)

d = {Name: [(name, val) 
            for name, val in ct.loc[ct.index != Name, ct.ix[Name] == 1]
            .sum(axis=1).iteritems() if val] 
     for Name in df.Name.unique()}

>>> pd.DataFrame({'Name': d.keys(), 
                  'People': [[t[0] for t in d[name]] for name in d], 
                  'times': [[t[1] for t in d[name]] for name in d]})
      Name                               People            times
0  Richard                   [James, John, Tom]        [1, 1, 1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]
3    Frank                                [Tom]              [1]
4     Paul                          [John, Tom]           [1, 1]
5     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]

EXPLANATION 说明

The crosstab gets you the location of each name by Item type. 交叉表可按项目类型获取每个名称的位置。

>>> ct
Item     A  B  C
Name            
Frank    0  1  0
James    0  0  1
John     1  0  1
Paul     1  0  0
Richard  0  0  1
Tom      1  1  1

This table is then shrunk. 然后将该表缩小。 For each name on which the key is being build, that name is removed from the table and only columns where that name appears are chosen. 对于在其上构建密钥的每个名称,将从表中删除该名称,仅选择出现该名称的列。

Using 'John' as an example: 以“ John”为例:

>>> ct.loc[ct.index != 'John', ct.ix['John'] == 1]
Item     A  C
Name         
Frank    0  0
James    0  1
Paul     1  0
Richard  0  1
Tom      1  1

This result is then summed along the rows to yield the results for John: 然后,将这些结果沿行求和以得出John的结果:

Name
Frank      0
James      1
Paul       1
Richard    1
Tom        2
dtype: int64

These results are then iterated over to pack them into tuple pairs and to remove the case where the value is zero (eg Frank above). 然后,将这些结果进行迭代,以将它们打包为元组对,并删除值为零的情况(例如,上面的Frank)。

>>> [(name, val) for name, val in 
     ct.loc[ct.index != 'John', ct.ix['John'] == 1].sum(axis=1).iteritems() if val]
[('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)]

This action is performed for each name using a dictionary comprehension. 使用字典理解对每个名称执行此操作。

>>> d
{'Frank': [('Tom', 1)],
 'James': [('John', 1), ('Richard', 1), ('Tom', 1)],
 'John': [('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)],
 'Paul': [('John', 1), ('Tom', 1)],
 'Richard': [('James', 1), ('John', 1), ('Tom', 1)],
 'Tom': [('Frank', 1), ('James', 1), ('John', 2), ('Paul', 1), ('Richard', 1)]}

This dictionary is then used to create the desired dataframe using a nested list comprehension to unpack the tuple pairs. 然后,使用此字典使用嵌套列表推导来解压缩元组对,以创建所需的数据帧。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM