從另一個數據框的列中為數據框的每一行查找一個特定值

Question

我正在尋找用Python（尤其是Pandas）替換Excel中使用的函數的替代方法。 函數之一是COUNTIFS（），我主要使用該函數在固定范圍內定位特定的行值。 這主要用於確定一列中的特定值是否存在於另一列中。

Excel中的示例如下所示：

第一行的代碼（列：col1_in_col2）：

= COUNTIFS（$ B $ 2：$ B $ 6，A2）

我試圖在Pandas中重新創建函數，只是區別在於可以在兩個不同的DataFrames中找到兩列，並且DataFrames在字典中（bigdict）。 代碼如下：

import pandas as pd

bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}

bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)

在上面的示例中，第一行的返回值應為零，而其他行的返回值應為1，因為可以在其他DataFrame的列中找到它。 但是，每行的返回值為0。

Answer 1

嘗試這個。 我將您的字典拆成兩個數據框並比較了它的值。

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)

Answer 2

這是使用列表推導的方法：

bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]

輸出：

           col1  df1_in_df2
0  0110200_2016           0
1   011037_2016           1
2   011037_2016           1
3  0111054_2016           1

Answer 3

這基本上與iloc的答案相同，但是您擺脫了np.where和iloc ，這可以使其更具可讀性，並最終變得更快。

import pandas as pd

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
                             "011037_2016", "0111054_2016"]})

df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
                              "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")

更新

定時

在這里，我嘗試比較4種方法@ vlemaistre，@ Ashwini，@ SamLegesse和我的。

import pandas as pd
import numpy as np

# create fake data
n = int(1e6)
n1 = int(1e4)

df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]

df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)

# backup dataframe
df0 = df1.copy()

Vlemaistre

bigdict = {"df1": df1, "df2": df2}

%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]

CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s

薩姆·萊格塞斯

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return
df1 = df0.copy()

%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])

CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s

阿什維尼

df1 = df0.copy()

%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)

CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms

拉帕奈

這與Ashwini的解決方案完全一樣

df1 = df0.copy()

%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")

CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms

結論

向量方法比使用apply方法至少快1684倍。

Answer 4

在我看來，最簡單的方法是制作一個通用函數，您可以在想要執行excel countif（）等效的任何時候應用它。

import pandas as pd

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])

編輯：

正如評論中提到的rpanai一樣，隨着數據的增長，apply存在性能問題。 使用numpy向量化將大大提高性能。 這是Ashwini答案的修改版本。

import pandas as pd
import numpy as np

def countif(df1, df2, col1, col2, name):
    df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

countif(df1,df2,'col1','col1','df1_in_df2')

print(df1)
#            col1  df1_in_df2
# 0  0110200_2016           0
# 1   011037_2016           1
# 2   011037_2016           1
# 3  0111054_2016           1

從另一個數據框的列中為數據框的每一行查找一個特定值

問題描述

4 個解決方案

解決方案1
3 2019-08-27 12:58:59

解決方案2
1 2019-08-27 13:00:09

解決方案3
1 已采納 2019-08-27 13:04:39

定時

Vlemaistre

薩姆·萊格塞斯

阿什維尼

拉帕奈

結論

解決方案4
0 2019-08-27 13:09:56

從另一個數據框的列中為數據框的每一行查找一個特定值

問題描述

4 個解決方案

解決方案1 3 2019-08-27 12:58:59

解決方案2 1 2019-08-27 13:00:09

解決方案3 1 已采納 2019-08-27 13:04:39

定時

Vlemaistre

薩姆·萊格塞斯

阿什維尼

拉帕奈

結論

解決方案4 0 2019-08-27 13:09:56

解決方案1
3 2019-08-27 12:58:59

解決方案2
1 2019-08-27 13:00:09

解決方案3
1 已采納 2019-08-27 13:04:39

解決方案4
0 2019-08-27 13:09:56