简体   繁体   English

如何将熊猫 isin 用于多列

[英]how to use pandas isin for multiple columns

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

I want to find the values of col1 and col2 where the col1 and col2 of the first dataframe are both in the second dataframe.我想找到col1col2的值,其中第一个数据帧的col1col2都在第二个数据帧中。

These rows should be in the result dataframe:这些行应该在结果数据框中:

  1. pizza, boy比萨,男孩

  2. pizza, girl比萨,女孩

  3. ice cream, boy冰淇淋,男孩

because all three rows are in the first and second dataframes.因为所有三行都在第一个和第二个数据帧中。

How do I possibly accomplish this?我怎么可能做到这一点? I was thinking of using isin , but I am not sure how to use it when I have to consider more than one column.我正在考虑使用isin ,但是当我必须考虑多个列时,我不确定如何使用它。

Perform an inner merge on col1 and col2 :col1col2上执行内部合并

import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))

print(pd.merge(df2.reset_index(), df1, how='inner').set_index('index'))

yields产量

            col1  col2
index                 
10         pizza   boy
11         pizza  girl
16     ice cream   boy

The purpose of the reset_index and set_index calls are to preserve df2 's index as in the desired result you posted. reset_indexset_index调用的目的是保留df2的索引,就像您发布的所需结果一样。 If the index is not important, then如果索引不重要,那么

pd.merge(df2, df1, how='inner')
#         col1  col2
# 0      pizza   boy
# 1      pizza  girl
# 2  ice cream   boy

would suffice.就足够了。


Alternatively, you could construct MultiIndex s out of the col1 and col2 columns, and then call the MultiIndex.isin method :或者,您可以从col1col2列构造MultiIndex ,然后调用MultiIndex.isin方法

index1 = pd.MultiIndex.from_arrays([df1[col] for col in ['col1', 'col2']])
index2 = pd.MultiIndex.from_arrays([df2[col] for col in ['col1', 'col2']])
print(df2.loc[index2.isin(index1)])

yields产量

         col1  col2
10      pizza   boy
11      pizza  girl
16  ice cream   boy

Thank you unutbu!谢谢你! Here is a little update.这是一个小更新。

import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))
df1[df1.set_index(['col1','col2']).index.isin(df2.set_index(['col1','col2']).index)]

return:返回:

    col1    col2
1   pizza   boy
4   pizza   girl
5   ice cream   boy

If somehow you must stick to isin or the negate version ~isin .如果不知何故你必须坚持isin或否定版本~isin You may first create a new column, with the concatenation of col1 , col2 .您可以先创建一个新列,将col1col2连接起来。 Then use isin to filter your data.然后使用isin过滤您的数据。 Here is the code:这是代码:

import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))

df1['indicator'] = df1['col1'].str.cat(df1['col2'])
df2['indicator'] = df2['col1'].str.cat(df2['col2'])

df2.loc[df2['indicator'].isin(df1['indicator'])].drop(columns=['indicator'])

which gives这使


    col1    col2
10  pizza   boy
11  pizza   girl
16  ice cream   boy

If you do so remember to make sure that concatenating two columns doesn't create false positives eg concatenation of 123 and 456 in df1 and concatenation of 12 and 3456 in df2 will match even though their respective columns don't match.如果您这样做,请记住确保连接两列不会产生误报,例如df1中的123456的连接以及df2中的123456的连接将匹配,即使它们各自的列不匹配。 You can fix this problem by additional sep parameter.您可以通过附加sep参数来解决此问题。

df1['indicator'] = df1['col1'].str.cat(df1['col2'], sep='$$$')
df2['indicator'] = df2['col1'].str.cat(df2['col2'], sep='$$$')

One possible way is to define a check function of your own and perform apply on the dataframe.一种可能的方法是定义您自己的检查功能并在数据帧上执行apply

For example, if you know the list of combinations that need to be filtered (this list can be extracted beforehand from a dataframe):例如,如果您知道需要过滤的组合列表(可以预先从数据框中提取此列表):

filter_list_multicols = [["book", "cat"], ["table", "dog"], ["table", "cat"], ["pen", "horse"], ["book", "horse"]]

Then you could define a check function as so:然后你可以这样定义一个检查函数:

def isin_multicols_check(stationary_name, animal_name):
    for filter_pair in filter_list_multicols:
        if (stationary_name == filter_pair[0]) and (animal_name == filter_pair[1]):
                return True

    return False

Example dataframe:示例数据框:

df = pd.DataFrame([
                   [1, "book", "dog"], [2, "pen", "dog"], [3, "pen", "rat"], [4, "book", "horse"], [5, "book", "cat"]
                  ],
                   columns=["S.N.", "stationary_name", "animal_name"])
df
S.N.    stationary_name  animal_name
1           book            dog
2           pen             dog
3           pen             rat
4           book            horse
5           book            cat

And now, call the function using pandas apply :现在,使用 pandas apply调用该函数:

df["is_in"] = df.apply(lambda x: isin_multicols_check(x.stationary_name, x.animal_name), axis=1)
df
S.N.    stationary_name  animal_name    is_in
1           book            dog         false
2           pen             dog         false
3           pen             rat         false
4           book            horse       true
5           book            cat         true

The result:结果:

is_in = df[df["is_in"]==True]
not_is_in = df[df["is_in"]==False]

The best way is to pass a dict to isin()最好的方法是将 dict 传递给 isin()

As https://www.oreilly.com/library/view/mastering-exploratory-analysis/9781789619638/eb563c9a-83e1-4e0c-82d7-6f83addc3340.xhtml suggests.正如https://www.oreilly.com/library/view/mastering-exploratory-analysis/9781789619638/eb563c9a-83e1-4e0c-82d7-6f83addc3340.xhtml建议的那样。

Also the documentation https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html shows another example of how to pass a dictionary.此外,文档https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html显示了如何传递字典的另一个示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM