如何检查一个pandas列中列表中的所有元素是否存在于另一个pandas列中

Question

I have a lists in one column of a dataframe df1 , and I want to check to see for each row if all elements of that list are in another column that is in a second dataframe df2 .我在数据帧df1一列中有一个列表，我想检查每一行是否该列表的所有元素都在第二个数据帧df2中的另一列中。

The two dataframes are something like this:这两个数据框是这样的：

df1                                          df2

id | members      |                          num  |  available           |
1  |['a',b']      |                          one  | ['a','b','c','d','e']|
2  |['b']         |                          two  | ['a','b']            |
3  |['a','b','c'] |                          three| ['b','d','e']        |

I am trying to come up with a method that can give me which rows in df2 have all elements of members for each row in df1 .我正在尝试提出一种方法，该方法可以为我提供df2哪些行具有df1每一行的所有members元素。 Maybe something that yields this:也许会产生这样的结果：


id | members      | which_cols            |                
1  |['a',b']      | ['one','two']         |                       
2  |['b']         | ['one','two','three'] |                         
3  |['a','b','c'] | ['one']               |

I thought converting it into dictionaries like {k: list(v) for k,v in df1.groupby("id")["members"]} and {i: list(j) for i,j in df2.groupby("num")["available"]} might make it more flexible to achieve the desired output but still haven't found a method to get to what I'm looking for.我{k: list(v) for k,v in df1.groupby("id")["members"]}它转换成字典，比如{k: list(v) for k,v in df1.groupby("id")["members"]}和{i: list(j) for i,j in df2.groupby("num")["available"]}可能会更灵活地实现所需的输出，但仍然没有找到一种方法来获得我正在寻找的内容。

df2 will have about 300 rows with length of available being as large as 25,000 . df2将有大约300行， available长度与25,000一样大。 And df1 can be as big as 1M rows with list length in members up to 15. So I think efficiency will also be important. df1可以大到1M行， members列表长度最多为 15。所以我认为效率也很重要。

Answer 1

The core of the problem lies in your data setup.问题的核心在于您的数据设置。 If you do a bit of preprocessing, you can avoid tediously iterating through every list multiple times over.如果您进行一些预处理，则可以避免多次重复遍历每个列表。

Setup设置

df1 = pd.Series([['a', 'b'], ['b'], ['a', 'b', 'c']], name = 'members').to_frame()
df2 = pd.Series([['a', 'b', 'c', 'd', 'e'], 
                  ['a', 'b'],
                  ['b', 'd', 'e']], name = 'available').to_frame()
df2.index = ['one', 'two', 'three']

>>> df1

    members
0   ['a', 'b']
1   ['b']
2   ['a', 'b', 'c']

>>> df2

        available
one.    ['a', 'b', 'c', 'd', 'e']
two     ['a', 'b']
three   ['b', 'd', 'e']

Reshape Data重塑数据

If you one-hot encode your data before working with it, you put yourself at a great advantage for doing subset checks:如果您在使用数据之前对其进行一次性编码，那么您将在进行子集检查方面处于极大优势：

# You can do this many ways, but sklearn makes this very easy with:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = df1.join(pd.DataFrame(mlb.fit_transform(df1.pop('members')),
                          columns=mlb.classes_, index=df1.index))

mlb = MultiLabelBinarizer()
df2 = df2.join(pd.DataFrame(mlb.fit_transform(df2.pop('available')),
                          columns=mlb.classes_, index=df2.index))

>>> df1
    a   b   c
0   1   1   0
1   0   1   0
2   1   1   1


>>> df2
        a   b   c   d   e
one     1   1   1   1   1
two     1   1   0   0   0
three   0   1   0   1   1

Calculation计算

The clever thing about this data format, is that now you can subtract df1 from df2 and if none of your resultant values are -1 (indicating a lack of an element in df2 , then you add that to the list. Think of this as overlaying the two dataframes (aligning each resource) and then subtracting. And of course, this can be vectorized:这种数据格式的巧妙之处在于，现在您可以从df2减去df1并且如果您的所有结果值都不是 -1（表示df2缺少元素，那么您将其添加到列表中。将其视为叠加两个数据帧（对齐每个资源）然后相减。当然，这可以矢量化：

>>> df1.apply(lambda row: df2.index[((df2[df1.columns] - row) >= 0).all(axis = 1)], axis = 1)

0   Index(['one', 'two'], dtype='object')
1   Index(['one', 'two', 'three'], dtype='object')
2   Index(['one'], dtype='object')

如何检查一个pandas列中列表中的所有元素是否存在于另一个pandas列中

问题描述

1 个解决方案

解决方案1
0 2020-02-28 21:57:53

Setup设置

Reshape Data重塑数据

Calculation计算

如何检查一个pandas列中列表中的所有元素是否存在于另一个pandas列中

问题描述

1 个解决方案

解决方案1 0 2020-02-28 21:57:53

Setup设置

Reshape Data重塑数据

Calculation计算

解决方案1
0 2020-02-28 21:57:53