简体   繁体   English

如何检查一个pandas列中列表中的所有元素是否存在于另一个pandas列中

[英]How to check if all the elements in list in one pandas column are present in another pandas column

I have a lists in one column of a dataframe df1 , and I want to check to see for each row if all elements of that list are in another column that is in a second dataframe df2 .我在数据帧df1一列中有一个列表,我想检查每一行是否该列表的所有元素都在第二个数据帧df2中的另一列中。

The two dataframes are something like this:这两个数据框是这样的:

df1                                          df2

id | members      |                          num  |  available           |
1  |['a',b']      |                          one  | ['a','b','c','d','e']|
2  |['b']         |                          two  | ['a','b']            |
3  |['a','b','c'] |                          three| ['b','d','e']        |

I am trying to come up with a method that can give me which rows in df2 have all elements of members for each row in df1 .我正在尝试提出一种方法,该方法可以为我提供df2哪些行具有df1每一行的所有members元素。 Maybe something that yields this:也许会产生这样的结果:


id | members      | which_cols            |                
1  |['a',b']      | ['one','two']         |                       
2  |['b']         | ['one','two','three'] |                         
3  |['a','b','c'] | ['one']               |                      

I thought converting it into dictionaries like {k: list(v) for k,v in df1.groupby("id")["members"]} and {i: list(j) for i,j in df2.groupby("num")["available"]} might make it more flexible to achieve the desired output but still haven't found a method to get to what I'm looking for.{k: list(v) for k,v in df1.groupby("id")["members"]}它转换成字典,比如{k: list(v) for k,v in df1.groupby("id")["members"]}{i: list(j) for i,j in df2.groupby("num")["available"]}可能会更灵活地实现所需的输出,但仍然没有找到一种方法来获得我正在寻找的内容。

df2 will have about 300 rows with length of available being as large as 25,000 . df2将有大约300行, available长度与25,000一样大。 And df1 can be as big as 1M rows with list length in members up to 15. So I think efficiency will also be important. df1可以大到1M行, members列表长度最多为 15。所以我认为效率也很重要。

The core of the problem lies in your data setup.问题的核心在于您的数据设置。 If you do a bit of preprocessing, you can avoid tediously iterating through every list multiple times over.如果您进行一些预处理,则可以避免多次重复遍历每个列表。

Setup设置

df1 = pd.Series([['a', 'b'], ['b'], ['a', 'b', 'c']], name = 'members').to_frame()
df2 = pd.Series([['a', 'b', 'c', 'd', 'e'], 
                  ['a', 'b'],
                  ['b', 'd', 'e']], name = 'available').to_frame()
df2.index = ['one', 'two', 'three']

>>> df1

    members
0   ['a', 'b']
1   ['b']
2   ['a', 'b', 'c']

>>> df2

        available
one.    ['a', 'b', 'c', 'd', 'e']
two     ['a', 'b']
three   ['b', 'd', 'e']

Reshape Data重塑数据

If you one-hot encode your data before working with it, you put yourself at a great advantage for doing subset checks:如果您在使用数据之前对其进行一次性编码,那么您将在进行子集检查方面处于极大优势:

# You can do this many ways, but sklearn makes this very easy with:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = df1.join(pd.DataFrame(mlb.fit_transform(df1.pop('members')),
                          columns=mlb.classes_, index=df1.index))

mlb = MultiLabelBinarizer()
df2 = df2.join(pd.DataFrame(mlb.fit_transform(df2.pop('available')),
                          columns=mlb.classes_, index=df2.index))

>>> df1
    a   b   c
0   1   1   0
1   0   1   0
2   1   1   1


>>> df2
        a   b   c   d   e
one     1   1   1   1   1
two     1   1   0   0   0
three   0   1   0   1   1

Calculation计算

The clever thing about this data format, is that now you can subtract df1 from df2 and if none of your resultant values are -1 (indicating a lack of an element in df2 , then you add that to the list. Think of this as overlaying the two dataframes (aligning each resource) and then subtracting. And of course, this can be vectorized:这种数据格式的巧妙之处在于,现在您可以从df2减去df1并且如果您的所有结果值都不是 -1(表示df2缺少元素,那么您将其添加到列表中。将其视为叠加两个数据帧(对齐每个资源)然后相减。当然,这可以矢量化:

>>> df1.apply(lambda row: df2.index[((df2[df1.columns] - row) >= 0).all(axis = 1)], axis = 1)

0   Index(['one', 'two'], dtype='object')
1   Index(['one', 'two', 'three'], dtype='object')
2   Index(['one'], dtype='object')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何检查 pandas 列中的字符串列表的元素是否存在于另一列中 - How to check if elements of a list of strings in a pandas column are present in another column 如何检查列表中的所有元素是否都存在于 pandas 列中 - How to check if all the elements in list are present in pandas column 检查列表的一个或多个元素是否存在于 Pandas 列中 - Check if one or more elements of a list are present in Pandas column 如何检查 dataframe pandas 中是否不存在列列表 - how to check if a list of column in not present in a dataframe pandas 熊猫-检查列表中的所有元素是否都在列中 - Pandas - check to see if all elements in a list are in a column 如何删除 Pandas 中另一列 B 中存在的 A 列中的常见元素? - How do I delete common elements from one column A that are present in another column B in Pandas? 如何检查 Pandas 的另一列中是否存在一列中的数据? - How do you check if data in one column is present in another column in Pandas? Python pandas 在另一列的元素列表中查找一列的元素 - Python pandas find element of one column in list of elements of another column 检查熊猫列值是否存在于另一个熊猫列(列表)中 - Checking if a pandas column value is present in another pandas column (list) Pandas:如何检查数据框列中的任何列表是否存在于另一个数据帧的范围内? - Pandas: How to check if any of a list in a dataframe column is present in a range in another dataframe?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM