简体   繁体   English

使用pandas查找列表中的所有匹配组

[英]Find all matching groups in a list of lists with pandas

I would like to find all cases for all ids in a Pandas DataFrame. 我想在Pandas DataFrame中找到所有id的所有情况。 What would be an efficient solution? 什么是有效的解决方案? I have around 10k of records and it is processed server-side. 我有大约10k的记录,它在服务器端处理。 Would it be a good idea to create a new DataFrame, or is there a more efficient data structure I can use? 创建一个新的DataFrame是一个好主意,还是我可以使用更高效的数据结构? A case is satisfied when an id contains all names in a case. 当id包含案例中的所有名称时,将满足一个案例。

Input (Pandas DataFrame) 输入(Pandas DataFrame)

id | name |
-----------
1  | bla1 |
2  | bla2 |
2  | bla3 |
2  | bla4 |
3  | bla5 |
4  | bla9 |
5  | bla6 |
5  | bla7 |
6  | bla8 |

Cases 案例

names [
  [bla2, bla3, bla4], #case 1
  [bla1, bla3, bla7], #case 2
  [bla3, bla1, bla6], #case 3
  [bla6, bla7] #case 4
]

Needed output (unless there is a more efficient way) 需要的输出(除非有更有效的方式)

id | case1 | case2 | case3 | case4 |
------------------------------------
1  | 0     | 0     | 0     | 0     |
2  | 1     | 0     | 0     | 0     |
3  | 0     | 0     | 0     | 0     |
4  | 0     | 0     | 0     | 0     |
5  | 0     | 0     | 0     | 1     |
6  | 0     | 0     | 0     | 0     |
names = [
   ['bla2', 'bla3', 'bla4'], # case 1
   ['bla1', 'bla3', 'bla7'], # case 2
   ['bla3', 'bla1', 'bla6'], # case 3
   ['bla6', 'bla7']          # case 4
]

df = df.groupby('id').apply(lambda x: \
                pd.Series([int(pd.Series(y).isin(x['name']).all()) for y in names]))\
       .rename(columns=lambda x: 'case{}'.format(x + 1))

df
+------+---------+---------+---------+---------+
|   id |   case1 |   case2 |   case3 |   case4 |
|------+---------+---------+---------+---------|
|    1 |       0 |       0 |       0 |       0 |
|    2 |       1 |       0 |       0 |       0 |
|    3 |       0 |       0 |       0 |       0 |
|    5 |       0 |       0 |       0 |       1 |
|    6 |       0 |       0 |       0 |       0 |
+------+---------+---------+---------+---------+

First, groupby id , and then apply apply a check successively on each case, for each group. 首先, groupby id ,然后对每个案例依次对每个案例进行一次检查。 The objective is to check whether all items in a group will match with a given case. 目标是检查组中的所有项目是否与给定案例匹配。 This is handled by the isin in conjunction with the list comprehension. 这由isin结合列表理解来处理。 The outer pd.Series will expand the result to separate columns and df.rename is used to rename the columns. 外部pd.Series将结果扩展为单独的列, df.rename用于重命名列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM