[英]Finding a specific value for each row in a Dataframe from another Dataframe's column
I am looking for alternate ways to replace functions used in Excel with Python, especially with Pandas. 我正在寻找用Python(尤其是Pandas)替换Excel中使用的函数的替代方法。 One of the functions is COUNTIFS(), which I have been primarily using to locate specific row values in a fixed range. 函数之一是COUNTIFS(),我主要使用该函数在固定范围内定位特定的行值。 This is mainly used to determine, whether the specific values in one column are present in the other column, or not. 这主要用于确定一列中的特定值是否存在于另一列中。
An example in Excel would look something like this: Excel中的示例如下所示:
The code for the first row (column: col1_in_col2): 第一行的代码(列:col1_in_col2):
=COUNTIFS($B$2:$B$6,A2) = COUNTIFS($ B $ 2:$ B $ 6,A2)
I have tried to recreate the function in Pandas, only with the difference that the two columns can be found in two different DataFrames and the DataFrames are inside a dictionary (bigdict). 我试图在Pandas中重新创建函数,只是区别在于可以在两个不同的DataFrames中找到两列,并且DataFrames在字典中(bigdict)。 The code is the following: 代码如下:
import pandas as pd
bigdict = {"df1": pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]}), "df2": pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})}
bigdict.get("df1")["df1_in_df2"] = bigdict.get("df1").apply(lambda x: 1 if x["col1"] in bigdict.get("df2")["col1"] else 0, axis=1)
In the example above, the first row should get a return value of zero, while the other rows should get return values of 1, since it can be found in the other DataFrame's column. 在上面的示例中,第一行的返回值应为零,而其他行的返回值应为1,因为可以在其他DataFrame的列中找到它。 However, the return value is 0 for every row. 但是,每行的返回值为0。
Try this. 尝试这个。 I unstacked your dictionary into two dataframes and compared its values. 我将您的字典拆成两个数据框并比较了它的值。
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2= pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
Here is way to do it using a list comprehension : 这是使用列表推导的方法:
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
for x in bigdict['df1']['col1']]
Output : 输出:
col1 df1_in_df2
0 0110200_2016 0
1 011037_2016 1
2 011037_2016 1
3 0111054_2016 1
This is basically the same as @Ashwini's answer but you get rid of np.where
and iloc
which could make it more readable and eventually faster. 这基本上与iloc
的答案相同,但是您摆脱了np.where
和iloc
,这可以使其更具可读性,并最终变得更快。
import pandas as pd
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016",
"011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016",
"011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
UPDATE 更新
Here I try to compare 4 methods @vlemaistre, @Ashwini, @SamLegesse and mine. 在这里,我尝试比较4种方法@ vlemaistre,@ Ashwini,@ SamLegesse和我的。
import pandas as pd
import numpy as np
# create fake data
n = int(1e6)
n1 = int(1e4)
df = pd.DataFrame()
df["col1"] = ["{:012}".format(i) for i in range(n)]
df2 = df.sample(n1)
toRemove = df2.sample(n1//2).index
df1 = df[~df.index.isin(toRemove)].sample(frac=1).reset_index(drop=True)
df2 = df2.reset_index(drop=True)
# backup dataframe
df0 = df1.copy()
bigdict = {"df1": df1, "df2": df2}
%%time
bigdict['df1']['df1_in_df2'] = [1 if x in bigdict['df2']['col1'].values else 0
for x in bigdict['df1']['col1']]
CPU times: user 4min 53s, sys: 3.08 s, total: 4min 56s
Wall time: 4min 41s
def countif(x,col):
if x in col.values:
return 1
else:
return 0
return
df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
CPU times: user 4min 48s, sys: 2.66 s, total: 4min 50s
Wall time: 4min 38s
df1 = df0.copy()
%%time
df1['df1_in_df2'] = np.where(df1.iloc[:,0].isin(list(df2.iloc[:,0])),1,0)
CPU times: user 167 ms, sys: 0 ns, total: 167 ms
Wall time: 165 ms
This is perfectly on par with Ashwini's solution 这与Ashwini的解决方案完全一样
df1 = df0.copy()
%%time
df1['df1_in_df2'] = df1["col1"].isin(df2['col1'].values).astype("int8")
CPU times: user 152 ms, sys: 0 ns, total: 152 ms
Wall time: 150 ms
The vectorial methods are (at least) 1684x faster than the one using apply
. 向量方法比使用apply
方法至少快1684倍。
Easiest way in my opinion would be to make a generic function that you can apply anytime you want to do the equivalent of an excel countif(). 在我看来,最简单的方法是制作一个通用函数,您可以在想要执行excel countif()等效的任何时候应用它。
import pandas as pd
def countif(x,col):
if x in col.values:
return 1
else:
return 0
return
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
df1['df1_in_df2'] = df1['col1'].apply(countif, col=df2['col1'])
EDIT: 编辑:
As rpanai mentioned in the comments, apply is known to have performance issues as your data grows. 正如评论中提到的rpanai一样,随着数据的增长,apply存在性能问题。 Using numpy vectorization would provide a large performance boost. 使用numpy向量化将大大提高性能。 Here is a modified version of Ashwini's answer. 这是Ashwini答案的修改版本。
import pandas as pd
import numpy as np
def countif(df1, df2, col1, col2, name):
df1[name] = np.where(df1[col1].isin(list(df2[col2])),1,0)
df1 = pd.DataFrame({"col1": ["0110200_2016", "011037_2016", "011037_2016", "0111054_2016"]})
df2 = pd.DataFrame({"col1" : ["011037_2016", "0111054_2016", "011109_2016", "0111268_2016"]})
countif(df1,df2,'col1','col1','df1_in_df2')
print(df1)
# col1 df1_in_df2
# 0 0110200_2016 0
# 1 011037_2016 1
# 2 011037_2016 1
# 3 0111054_2016 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.