从另一个数据框的列中为数据框的每一行查找一个特定值

Question

I am looking for alternate ways to replace functions used in Excel with Python, especially with Pandas. 我正在寻找用Python（尤其是Pandas）替换Excel中使用的函数的替代方法。 One of the functions is COUNTIFS(), which I have been primarily using to locate specific row values in a fixed range. 函数之一是COUNTIFS（），我主要使用该函数在固定范围内定位特定的行值。 This is mainly used to determine, whether the specific values in one column are present in the other column, or not. 这主要用于确定一列中的特定值是否存在于另一列中。

An example in Excel would look something like this: Excel中的示例如下所示：

The code for the first row (column: col1_in_col2): 第一行的代码（列：col1_in_col2）：

=COUNTIFS($B$2:$B$6,A2) = COUNTIFS（$ B $ 2：$ B $ 6，A2）

I have tried to recreate the function in Pandas, only with the difference that the two columns can be found in two different DataFrames and the DataFrames are inside a dictionary (bigdict). 我试图在Pandas中重新创建函数，只是区别在于可以在两个不同的DataFrames中找到两列，并且DataFrames在字典中（bigdict）。 The code is the following: 代码如下：

import pandas as pd

bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}

bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)

In the example above, the first row should get a return value of zero, while the other rows should get return values of 1, since it can be found in the other DataFrame's column. 在上面的示例中，第一行的返回值应为零，而其他行的返回值应为1，因为可以在其他DataFrame的列中找到它。 However, the return value is 0 for every row. 但是，每行的返回值为0。

Answer 1

Try this. 尝试这个。 I unstacked your dictionary into two dataframes and compared its values. 我将您的字典拆成两个数据框并比较了它的值。

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)

Answer 2

Here is way to do it using a list comprehension : 这是使用列表推导的方法：

bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]

Output : 输出：

           col1  df1_in_df2
0  0110200_2016           0
1   011037_2016           1
2   011037_2016           1
3  0111054_2016           1

Answer 3

This is basically the same as @Ashwini's answer but you get rid of np.where and iloc which could make it more readable and eventually faster. 这基本上与iloc的答案相同，但是您摆脱了np.where和iloc ，这可以使其更具可读性，并最终变得更快。

import pandas as pd

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
                             "011037_2016", "0111054_2016"]})

df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
                              "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")

UPDATE 更新

Timing 定时

Here I try to compare 4 methods @vlemaistre, @Ashwini, @SamLegesse and mine. 在这里，我尝试比较4种方法@ vlemaistre，@ Ashwini，@ SamLegesse和我的。

import pandas as pd
import numpy as np

# create fake data
n = int(1e6)
n1 = int(1e4)

df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]

df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)

# backup dataframe
df0 = df1.copy()

vlemaistre Vlemaistre

bigdict = {"df1": df1, "df2": df2}

%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]

CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s

SamLegesse 萨姆·莱格塞斯

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return
df1 = df0.copy()

%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])

CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s

Ashwini 阿什维尼

df1 = df0.copy()

%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)

CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms

rpanai 拉帕奈

This is perfectly on par with Ashwini's solution 这与Ashwini的解决方案完全一样

df1 = df0.copy()

%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")

CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms

Conclusion 结论

The vectorial methods are (at least) 1684x faster than the one using apply . 向量方法比使用apply方法至少快1684倍。

Answer 4

Easiest way in my opinion would be to make a generic function that you can apply anytime you want to do the equivalent of an excel countif(). 在我看来，最简单的方法是制作一个通用函数，您可以在想要执行excel countif（）等效的任何时候应用它。

import pandas as pd

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])

EDIT: 编辑：

As rpanai mentioned in the comments, apply is known to have performance issues as your data grows. 正如评论中提到的rpanai一样，随着数据的增长，apply存在性能问题。 Using numpy vectorization would provide a large performance boost. 使用numpy向量化将大大提高性能。 Here is a modified version of Ashwini's answer. 这是Ashwini答案的修改版本。

import pandas as pd
import numpy as np

def countif(df1, df2, col1, col2, name):
    df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

countif(df1,df2,'col1','col1','df1_in_df2')

print(df1)
#            col1  df1_in_df2
# 0  0110200_2016           0
# 1   011037_2016           1
# 2   011037_2016           1
# 3  0111054_2016           1

从另一个数据框的列中为数据框的每一行查找一个特定值

问题描述

4 个解决方案

解决方案1
3 2019-08-27 12:58:59

解决方案2
1 2019-08-27 13:00:09

解决方案3
1 已采纳 2019-08-27 13:04:39

Timing 定时

vlemaistre Vlemaistre

SamLegesse 萨姆·莱格塞斯

Ashwini 阿什维尼

rpanai 拉帕奈

Conclusion 结论

解决方案4
0 2019-08-27 13:09:56

从另一个数据框的列中为数据框的每一行查找一个特定值

问题描述

4 个解决方案

解决方案1 3 2019-08-27 12:58:59

解决方案2 1 2019-08-27 13:00:09

解决方案3 1 已采纳 2019-08-27 13:04:39

Timing 定时

vlemaistre Vlemaistre

SamLegesse 萨姆·莱格塞斯

Ashwini 阿什维尼

rpanai 拉帕奈

Conclusion 结论

解决方案4 0 2019-08-27 13:09:56

解决方案1
3 2019-08-27 12:58:59

解决方案2
1 2019-08-27 13:00:09

解决方案3
1 已采纳 2019-08-27 13:04:39

解决方案4
0 2019-08-27 13:09:56