简体   繁体   English

从另一个数据框的列中为数据框的每一行查找一个特定值

[英]Finding a specific value for each row in a Dataframe from another Dataframe's column

I am looking for alternate ways to replace functions used in Excel with Python, especially with Pandas. 我正在寻找用Python(尤其是Pandas)替换Excel中使用的函数的替代方法。 One of the functions is COUNTIFS(), which I have been primarily using to locate specific row values in a fixed range. 函数之一是COUNTIFS(),我主要使用该函数在固定范围内定位特定的行值。 This is mainly used to determine, whether the specific values in one column are present in the other column, or not. 这主要用于确定一列中的特定值是否存在于另一列中。

An example in Excel would look something like this: Excel中的示例如下所示:

在此处输入图片说明

The code for the first row (column: col1_in_col2): 第一行的代码(列:col1_in_col2):

=COUNTIFS($B$2:$B$6,A2) = COUNTIFS($ B $ 2:$ B $ 6,A2)

I have tried to recreate the function in Pandas, only with the difference that the two columns can be found in two different DataFrames and the DataFrames are inside a dictionary (bigdict). 我试图在Pandas中重新创建函数,只是区别在于可以在两个不同的DataFrames中找到两列,并且DataFrames在字典中(bigdict)。 The code is the following: 代码如下:

import pandas as pd

bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}

bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)

In the example above, the first row should get a return value of zero, while the other rows should get return values of 1, since it can be found in the other DataFrame's column. 在上面的示例中,第一行的返回值应为零,而其他行的返回值应为1,因为可以在其他DataFrame的列中找到它。 However, the return value is 0 for every row. 但是,每行的返回值为0。

Try this. 尝试这个。 I unstacked your dictionary into two dataframes and compared its values. 我将您的字典拆成两个数据框并比较了它的值。

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)

Here is way to do it using a list comprehension : 这是使用列表推导的方法:

bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]

Output : 输出:

           col1  df1_in_df2
0  0110200_2016           0
1   011037_2016           1
2   011037_2016           1
3  0111054_2016           1

This is basically the same as @Ashwini's answer but you get rid of np.where and iloc which could make it more readable and eventually faster. 这基本上与iloc的答案相同,但是您摆脱了np.whereiloc ,这可以使其更具可读性,并最终变得更快。

import pandas as pd

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
                             "011037_2016", "0111054_2016"]})

df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
                              "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")

UPDATE 更新

Timing 定时

Here I try to compare 4 methods @vlemaistre, @Ashwini, @SamLegesse and mine. 在这里,我尝试比较4种方法@ vlemaistre,@ Ashwini,@ SamLegesse和我的。

import pandas as pd
import numpy as np

# create fake data
n = int(1e6)
n1 = int(1e4)

df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]

df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)

# backup dataframe
df0 = df1.copy()

vlemaistre Vlemaistre

bigdict = {"df1": df1, "df2": df2}
%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
                                for x in bigdict['df1']['col1']]
CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s

SamLegesse 萨姆·莱格塞斯

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return
df1 = df0.copy()

%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s

Ashwini 阿什维尼

df1 = df0.copy()
%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms

rpanai 拉帕奈

This is perfectly on par with Ashwini's solution 这与Ashwini的解决方案完全一样

df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms

Conclusion 结论

The vectorial methods are (at least) 1684x faster than the one using apply . 向量方法比使用apply方法至少快1684倍。

Easiest way in my opinion would be to make a generic function that you can apply anytime you want to do the equivalent of an excel countif(). 在我看来,最简单的方法是制作一个通用函数,您可以在想要执行excel countif()等效的任何时候应用它。

import pandas as pd

def countif(x,col):
    if x in col.values:
        return 1
    else:
        return 0
    return

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])

EDIT: 编辑:

As rpanai mentioned in the comments, apply is known to have performance issues as your data grows. 正如评论中提到的rpanai一样,随着数据的增长,apply存在性能问题。 Using numpy vectorization would provide a large performance boost. 使用numpy向量化将大大提高性能。 Here is a modified version of Ashwini's answer. 这是Ashwini答案的修改版本。

import pandas as pd
import numpy as np

def countif(df1, df2, col1, col2, name):
    df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)

df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})

countif(df1,df2,'col1','col1','df1_in_df2')

print(df1)
#            col1  df1_in_df2
# 0  0110200_2016           0
# 1   011037_2016           1
# 2   011037_2016           1
# 3  0111054_2016           1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何迭代每一行并从一个 dataframe 的特定列中找到下一个匹配列值并将其与另一个 dataframe 进行比较? - How to iterate each row and find the next matching column value from a specific column from one dataframe and comparing it to another dataframe? 如何将 dataframe 中的每一列与另一个 dataframe pandas 的行相乘? - How to multiply each column in a dataframe with a row from another dataframe pandas? 如何将数据框的值复制到另一个数据框的最后一列/行 - How to copy value of dataframe to another dataframe's last column/row 用另一个列的每个值迭代数据框的一行的值 - iterate a value of a row of a dataframe with each value of a column in another 是否有 pandas 方法通过特定列值为每一行添加 dataframe - Is there a pandas way to add a dataframe for each row by a specific column value 检查数据框中的值是否存在于每一行的另一列中 - Check if value in dataframe exists in another column for each row 将同一行从 pandas dataframe 多次添加到新行,每次更改特定列中的值 - Add the same row multiple times from a pandas dataframe to a new one, each time altering a value in a specific column 从数据框中减去特定列的每一行并添加到列表 -python - Subtract each row of specific column from dataframe and add to the list -python 从另一个数据帧列中的另一个单词列表中删除数据帧列中每一行中的单词 - Remove words in each row in a column of dataframe from another list of words in a column of another dataframe 查找 dataframe 中列值更改的行 - Finding row where a column value change in dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM