[英]Apply a function with two dataframe as argument
I'm looking the way to run function that take two dataframes as arguments df1 and df2.我正在寻找运行 function 的方法,它将两个数据帧作为 arguments df1 和 df2。
What I want is to create a new column in df1 from the information in df2 without using a loop because my entire df1 is 3M rows and df2 700k rows.我想要的是根据 df2 中的信息在 df1 中创建一个新列而不使用循环,因为我的整个 df1 是 3M 行和 df2 700k 行。 For that I compare if the data the value of X
of df1 is included in the from
and the to
of df2为此,我比较了 df1 的X
值是否包含在 df2 的from
和to
中
I tried with apply of pandas library but I got errors like:我尝试使用 pandas 库,但出现如下错误:
ValueError: Can only compare identically-labeled Series objects ValueError:只能比较相同标记的 Series 对象
Here is the sample of my code.这是我的代码示例。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})
df2 = pd.DataFrame({
'from': [1.0, 1.5, 1.8, 2.2, 2.6],
'to': [1.5, 1.8, 2.2, 2.6, 2.9],
'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
def asign(df1, df2):
if df1['Z'] == df2['Z']:
idx = np.where((df1[X] >= df2['from']) & (df1[X]<= df2['to']))[0]
df1['Y'] = df2['Y'][idx]
return df1
df1.groupby('Z').apply(asign, df2)
The output must be like: output 必须类似于:
>>> df1
out[0] :
X Z Y
0 1.0 F1 foo
1 1.1 F2 bar
2 1.2 F2 foobar
3 1.3 F1 foo
4 1.4 F1 foobar
5 1.5 F2 bar
6 1.6 F1 foo
7 1.7 F2 bar
The value of the column Y to be created in df1 is conditioned by the fact that the row belongs to the group Z either F1 or F2 and that the value of X is greater or equal to from and less than to Please can you help me to manage this?要在 df1 中创建的列 Y 的值取决于该行属于组 Z F1 或 F2 并且 X 的值大于或等于 from 且小于 to 请你能帮我到管理这个? Thank you谢谢
pd.cut()
使用pd.cut()
更好的解决方案The old solution below works well, but it might not be very efficient as it first creates a large data frame and then selects a subset of rows from it.下面的旧解决方案运行良好,但它可能不是很有效,因为它首先创建一个大数据框,然后从中选择行的子集。 This solution instead creates bins using pd.cut
and then merges the dataframes, directly creating the desired output.该解决方案改为使用pd.cut
创建 bin,然后合并数据帧,直接创建所需的 output。
In addition, this gives additional flexibility on how to make the merge.此外,这为如何进行合并提供了额外的灵活性。
df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})
df2 = pd.DataFrame({
'bmin': [1.0, 1.5, 1.8, 2.2, 2.6],
'bmax': [1.5, 1.8, 2.2, 2.6, 2.9],
'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
# Adding new column to the dataframes
bins = sorted(df2.bmin.unique()) + [df2.bmax.max()]
df1.loc[:, 'bin'] = pd.cut(
df1.X,
bins=bins,
labels=False, # Makes cut return int indices for the bins
include_lowest=True, # Otherwise 1.0 would be NaN
)
df2.loc[:, 'bin'] = pd.cut(
0.5 * (df2.bmin + df2.bmax),
bins=bins,
labels=False,
include_lowest=True,
)
# Merge on all relevant columns. Change how to 'inner' for an inner join
merged = pd.merge(df1, df2, on=["Z", "bin"], how='outer')
Sample of the output样本output
X Z bin bmin bmax Y
0 1.0 F1 0 1.0 1.5 foo
1 1.3 F1 0 1.0 1.5 foo
2 1.4 F1 0 1.0 1.5 foo
3 1.1 F2 0 NaN NaN NaN
4 1.2 F2 0 NaN NaN NaN
merge
followed by query
使用merge
后跟query
的旧解决方案Perhaps you'd be interested in DataFrame.query()
?也许您会对DataFrame.query()
感兴趣?
In the code below, I use query
on a merge of the dataframes on Z
.在下面的代码中,我对Z
上的数据帧合并使用了query
。 Note that the output data from this code differs from the one you write, but I don't see how请注意,此代码中的 output 数据与您编写的数据不同,但我看不出如何
1 1.1 F2 bar
could result from your input data since you want both the bin and Z
to match?因为您希望 bin 和Z
都匹配,所以可以从您的输入数据中得出结果? What I can see, there are no bins encapsulating 1.1 in df2
that also has Z=F2
.我所看到的是,在df2
中没有封装 1.1 的容器也有Z=F2
。 Apologies if I didn't understand you question.如果我不明白你的问题,我深表歉意。
Note that I renamed the columns for the bin limits in df2
as you can't use Python keywords in numexpr query.请注意,我在df2
中重命名了 bin 限制的列,因为您不能在 numexpr 查询中使用 Python 关键字。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})
df2 = pd.DataFrame({
'bmin': [1.0, 1.5, 1.8, 2.2, 2.6],
'bmax': [1.5, 1.8, 2.2, 2.6, 2.9],
'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
merged = pd.merge(
df1,
df2,
on='Z',
)
merged = merged.query('bmin <= X < bmax')
merged = merged.sort_values(by="X")[['X', 'Z', 'Y']]
Gives the output给出 output
X Z Y
0 1.0 F1 foo
3 1.3 F1 foo
6 1.4 F1 foo
10 1.7 F1 bar
50 1.8 F2 foobar
52 1.9 F2 foobar
20 2.2 F1 foo
23 2.3 F1 foo
26 2.4 F1 foo
29 2.5 F1 foo
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.