[英]Grouping entries in pandas dataframe
Hello I have a dataframe like this 您好我有一个这样的数据框
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'b':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James]})
df
Item Name
A Tom
A John
A Paul
B Tom
B Frank
C Tom
C John
C Richard
C James
For each people I want the list of the people with same item and them time 对于每个人,我想要具有相同项目的人的名单,以及他们的时间
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [2,1,1,1,1]
John [Tom, Richard, James] [2,1,1]
Paul [Tom, John] [1,1]
Frank [Tom] [1]
Richard [Tom, John, James] [1,1,1]
James [Tom, John, Richard] [1,1,1]
So far I have tried this to count the different people for the different items 到目前为止,我已经尝试过以此来计算不同项目的不同人
df.groupby("Item").agg({ "Name": pd.Series.nunique})
Name
Item
A 3
B 2
C 4
and 和
df.groupby("Name").agg({ "Item": pd.Series.nunique})
Item
Name
Frank 1
James 1
John 2
Paul 1
Richard 1
Tom 3
You can use numpy.unique
for count items in lists: 您可以将
numpy.unique
用于列表中的计数项目:
print df
Item Name
0 A Tom
1 A John
2 A Paul
3 B Tom
4 B Frank
5 C Tom
6 C John
7 C Richard
8 C James
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
ct = pd.crosstab(df.Name, df.Item)
d = {Name: [(name, val)
for name, val in ct.loc[ct.index != Name, ct.ix[Name] == 1]
.sum(axis=1).iteritems() if val]
for Name in df.Name.unique()}
>>> pd.DataFrame({'Name': d.keys(),
'People': [[t[0] for t in d[name]] for name in d],
'times': [[t[1] for t in d[name]] for name in d]})
Name People times
0 Richard [James, John, Tom] [1, 1, 1]
1 James [John, Richard, Tom] [1, 1, 1]
2 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
3 Frank [Tom] [1]
4 Paul [John, Tom] [1, 1]
5 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
EXPLANATION 说明
The crosstab gets you the location of each name by Item type. 交叉表可按项目类型获取每个名称的位置。
>>> ct
Item A B C
Name
Frank 0 1 0
James 0 0 1
John 1 0 1
Paul 1 0 0
Richard 0 0 1
Tom 1 1 1
This table is then shrunk. 然后将该表缩小。 For each name on which the key is being build, that name is removed from the table and only columns where that name appears are chosen.
对于在其上构建密钥的每个名称,将从表中删除该名称,仅选择出现该名称的列。
Using 'John' as an example: 以“ John”为例:
>>> ct.loc[ct.index != 'John', ct.ix['John'] == 1]
Item A C
Name
Frank 0 0
James 0 1
Paul 1 0
Richard 0 1
Tom 1 1
This result is then summed along the rows to yield the results for John: 然后,将这些结果沿行求和以得出John的结果:
Name
Frank 0
James 1
Paul 1
Richard 1
Tom 2
dtype: int64
These results are then iterated over to pack them into tuple pairs and to remove the case where the value is zero (eg Frank above). 然后,将这些结果进行迭代,以将它们打包为元组对,并删除值为零的情况(例如,上面的Frank)。
>>> [(name, val) for name, val in
ct.loc[ct.index != 'John', ct.ix['John'] == 1].sum(axis=1).iteritems() if val]
[('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)]
This action is performed for each name using a dictionary comprehension. 使用字典理解对每个名称执行此操作。
>>> d
{'Frank': [('Tom', 1)],
'James': [('John', 1), ('Richard', 1), ('Tom', 1)],
'John': [('James', 1), ('Paul', 1), ('Richard', 1), ('Tom', 2)],
'Paul': [('John', 1), ('Tom', 1)],
'Richard': [('James', 1), ('John', 1), ('Tom', 1)],
'Tom': [('Frank', 1), ('James', 1), ('John', 2), ('Paul', 1), ('Richard', 1)]}
This dictionary is then used to create the desired dataframe using a nested list comprehension to unpack the tuple pairs. 然后,使用此字典使用嵌套列表推导来解压缩元组对,以创建所需的数据帧。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.