[英]How to count most frequent combinations in each group
I have a pandas DataFrame with post_ID and tag_ID in a long format (one post to many tags).我有一个带有 post_ID 和 tag_ID 的长格式的 Pandas DataFrame(一个帖子到多个标签)。
+---------+--------+
| post_ID | tag_ID |
+---------+--------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 4 |
| 2 | 6 |
| 3 | 1 |
| 4 | 5 |
| 4 | 6 |
| ... | ... |
+---------+--------+
My question is: when looking at tags grouped by post_ID, what are the most frequent two tag combinations?我的问题是:在查看按 post_ID 分组的标签时,最常见的两个标签组合是什么? As a result, I would like to have a frame that contains results like this:
因此,我想要一个包含如下结果的框架:
+---------------------+-----+
| tag_ID_combinations | n |
+---------------------+-----+
| 1,2 | 50 |
| 3,4 | 200 |
| 5,6 | 20 |
+---------------------+-----+
Tags 1,2 and 3 for post_ID 1 should count as 1,2 , 1,3 and 2,3 if possible.如果可能,post_ID 1 的标签 1,2 和 3 应该算作 1,2 、 1,3 和 2,3 。 But an aggregation like 1,2,3-1x ;
但是像 1,2,3-1x 这样的聚合; 1,4,6-1x ;
1,4,6-1x ; 1-1x and 5,6-1x would also be very helpful.
1-1x 和 5,6-1x 也很有帮助。
You could use DataFrame.groupby('col').agg(func)
along with itertools.combinations
to get all of the 2 tag combinations and then use collections.Counter
to get the number of occurrences for each combination.您可以使用
DataFrame.groupby('col').agg(func)
和itertools.combinations
来获取所有 2 个标签组合,然后使用collections.Counter
获取每个组合的出现次数。
from collections import Counter
from itertools import combinations
import pandas as pd
groups = df.groupby('post_ID').agg(lambda g: list(combinations(g, 2)))
combos = pd.DataFrame(
Counter(groups.tag_ID.sum()).items(),
columns=['tag_ID_combos', 'count']
)
Following example alters some of the data from your question so that there will be at least a couple of tag combinations that occur more than once.以下示例更改了您问题中的一些数据,以便至少有几个标签组合出现多次。
from collections import Counter
from itertools import combinations
import pandas as pd
data = [(1,1),(1,2),(1,3),(2,1),(2,3),(2,6),(3,1),(4,3),(4,6)]
df = pd.DataFrame(data, columns=['post_ID', 'tag_ID'])
print(df)
# post_ID tag_ID
# 0 1 1
# 1 1 2
# 2 1 3
# 3 2 1
# 4 2 3
# 5 2 6
# 6 3 1
# 7 4 3
# 8 4 6
groups = df.groupby('post_ID').agg(lambda g: list(combinations(g, 2)))
combos = pd.DataFrame(Counter(groups.tag_ID.sum()).items(), columns=['tag_ID_combos', 'count'])
print(combos)
# tag_ID_combos count
# 0 (1, 2) 1
# 1 (1, 3) 2
# 2 (2, 3) 1
# 3 (1, 6) 1
# 4 (3, 6) 2
Here is a solution if you just want to aggregate the occurrence count by post_ID.如果您只想按 post_ID 聚合出现次数,这是一个解决方案。 This solution would count according to your example (post_id == 1):
此解决方案将根据您的示例进行计算(post_id == 1):
[1, 2, 3]: 1
[1, 2, 3]: 1
and not all possible combinations:并不是所有可能的组合:
[1, 2] = 1, [1, 3] = 1, [2, 3] = 1
[1, 2] = 1, [1, 3] = 1, [2, 3] = 1
df = df.groupby('post_ID')['tag_ID'].apply(list)
df = pd.DataFrame(df).reset_index()
# only if you want to throw out single occurrences
df = df[df['tag_ID'].map(len) > 1]
# cast the sorted lists to string
df['tag_ID_AS_STRING'] = [str(sorted(x)) for x in df['tag_ID']]
result = df['tag_ID_AS_STRING'].value_counts()
You can use group by .您可以使用group by 。 You can use the following
您可以使用以下
df.groupby(['post_ID', 'tag_ID']).count()
This will generate a table with the combination as the index.这将生成一个以组合为索引的表。
Another way is to create a combination另一种方法是创建一个组合
df['combo'] = df[['post_ID', 'tag_ID']].agg(tuple, axis=1)
Then do the group by on the combo
field.然后在
combo
字段上进行分组。
Both of the above requires more work, which I am sure you can do from the above.以上两个都需要更多的工作,我相信你可以从上面做。
The second kind of aggregation you mention is pretty straightforward to obtain:您提到的第二种聚合非常容易获得:
df = pd.DataFrame({'post_ID': [1, 1, 1, 2, 2, 2, 3, 4, 4],
'tag_ID': [1, 2, 3, 1, 4, 6, 1, 5, 6]})
df.groupby('post_ID').tag_ID.unique().value_counts()
# [1] 1
# [1, 4, 6] 1
# [1, 2, 3] 1
# [5, 6] 1
# Name: tag_ID, dtype: int64
The first kind of aggregation you asked for is inconsistent, which makes it hard to answer.您要求的第一种聚合不一致,这使得很难回答。 For
post_ID
1 you are asking for 1,2 , 1,3 and 2,3, without counting the combination of an element with itself (1,1 , 2,2, etc.).对于
post_ID
1,您要求 1,2 、 1,3 和 2,3 ,而不计算元素与其自身的组合(1,1 、 2,2 等)。 Yet for post_ID
3, you do say you want 1-1x, which is not a combination of tags.然而对于
post_ID
3,你确实说你想要 1-1x,这不是标签的组合。 If the latter is an error, you could just do this, even if it's not very elegant:如果后者是一个错误,你可以这样做,即使它不是很优雅:
First, get the combinations for each post_ID
:首先,获取每个
post_ID
的组合:
import itertools
combs_df = df.groupby('post_ID').tag_ID\
.apply(lambda x: list(itertools.combinations(x.tolist(), 2)))
combs_df
# post_ID
# 1 [(1, 2), (1, 3), (2, 3)]
# 2 [(1, 4), (1, 6), (4, 6)]
# 3 []
# 4 [(5, 6)]
# Name: tag_ID, dtype: object
Now, you flatten them by putting each row's list in a list:现在,您通过将每一行的列表放在一个列表中来展平它们:
combs_lst = []
combs_df.apply(lambda x: combs_lst.extend(x))
combs_lst
# [(1, 2), (1, 3), (2, 3), (1, 4), (1, 6), (4, 6), (5, 6)]
Now, it's trivial just to make the list as pandas series and do a value_count
:现在,只需将列表作为熊猫系列并执行
value_count
:
pd.Series(combs_lst).value_counts()
# (1, 4) 1
# (5, 6) 1
# (1, 6) 1
# (4, 6) 1
# (2, 3) 1
# (1, 3) 1
# (1, 2) 1
# dtype: int64
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.