应用一个 function 和两个 dataframe 作为参数

Question

I'm looking the way to run function that take two dataframes as arguments df1 and df2.我正在寻找运行 function 的方法，它将两个数据帧作为 arguments df1 和 df2。

What I want is to create a new column in df1 from the information in df2 without using a loop because my entire df1 is 3M rows and df2 700k rows.我想要的是根据 df2 中的信息在 df1 中创建一个新列而不使用循环，因为我的整个 df1 是 3M 行和 df2 700k 行。 For that I compare if the data the value of X of df1 is included in the from and the to of df2为此，我比较了 df1 的X值是否包含在 df2 的from和to中

I tried with apply of pandas library but I got errors like:我尝试使用 pandas 库，但出现如下错误：

ValueError: Can only compare identically-labeled Series objects ValueError：只能比较相同标记的 Series 对象

Here is the sample of my code.这是我的代码示例。

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
                         2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
                    'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
                         'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})

df2 = pd.DataFrame({
           'from': [1.0, 1.5, 1.8, 2.2, 2.6],
           'to': [1.5, 1.8, 2.2, 2.6, 2.9],
           'Z': ['F1', 'F1', 'F2', 'F1', 'F2'],
           'Y': ['foo', 'bar', 'foobar', 'foo', 'zoo']
})
def asign(df1, df2):
    if df1['Z'] == df2['Z']:
        idx = np.where((df1[X] >= df2['from']) & (df1[X]<= df2['to']))[0]
        df1['Y'] = df2['Y'][idx]
        return df1

df1.groupby('Z').apply(asign, df2)

The output must be like: output 必须类似于：

>>> df1
out[0] : 
    X    Z   Y
0   1.0  F1  foo
1   1.1  F2  bar
2   1.2  F2  foobar
3   1.3  F1  foo
4   1.4  F1  foobar
5   1.5  F2  bar
6   1.6  F1  foo
7   1.7  F2  bar

The value of the column Y to be created in df1 is conditioned by the fact that the row belongs to the group Z either F1 or F2 and that the value of X is greater or equal to from and less than to Please can you help me to manage this?要在 df1 中创建的列 Y 的值取决于该行属于组 Z F1 或 F2 并且 X 的值大于或等于 from 且小于 to 请你能帮我到管理这个？ Thank you谢谢

Answer 1

Better solution using `pd.cut()`使用`pd.cut()`更好的解决方案

The old solution below works well, but it might not be very efficient as it first creates a large data frame and then selects a subset of rows from it.下面的旧解决方案运行良好，但它可能不是很有效，因为它首先创建一个大数据框，然后从中选择行的子集。 This solution instead creates bins using pd.cut and then merges the dataframes, directly creating the desired output.该解决方案改为使用pd.cut创建 bin，然后合并数据帧，直接创建所需的 output。

In addition, this gives additional flexibility on how to make the merge.此外，这为如何进行合并提供了额外的灵活性。

df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
                         2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
                    'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
                         'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})

df2 = pd.DataFrame({
           'bmin': [1.0,   1.5,   1.8,      2.2,   2.6],
           'bmax': [1.5,   1.8,   2.2,      2.6,   2.9],
           'Z':    ['F1',  'F1',  'F2',     'F1',  'F2'],
           'Y':    ['foo', 'bar', 'foobar', 'foo', 'zoo']
})


# Adding new column to the dataframes
bins = sorted(df2.bmin.unique()) + [df2.bmax.max()]

df1.loc[:, 'bin'] = pd.cut(
    df1.X,
    bins=bins,
    labels=False,        # Makes cut return int indices for the bins
    include_lowest=True, # Otherwise 1.0 would be NaN
)
df2.loc[:, 'bin'] = pd.cut(
    0.5 * (df2.bmin + df2.bmax),
    bins=bins,
    labels=False,
    include_lowest=True,
)

# Merge on all relevant columns. Change how to 'inner' for an inner join
merged = pd.merge(df1, df2, on=["Z", "bin"], how='outer')

Sample of the output样本output

      X   Z  bin  bmin  bmax       Y
0   1.0  F1    0   1.0   1.5     foo
1   1.3  F1    0   1.0   1.5     foo
2   1.4  F1    0   1.0   1.5     foo
3   1.1  F2    0   NaN   NaN     NaN
4   1.2  F2    0   NaN   NaN     NaN

Old solution using `merge` followed by `query`使用`merge`后跟`query`的旧解决方案

Perhaps you'd be interested in DataFrame.query() ?也许您会对DataFrame.query()感兴趣？

In the code below, I use query on a merge of the dataframes on Z .在下面的代码中，我对Z上的数据帧合并使用了query 。 Note that the output data from this code differs from the one you write, but I don't see how请注意，此代码中的 output 数据与您编写的数据不同，但我看不出如何

1   1.1  F2  bar

could result from your input data since you want both the bin and Z to match?因为您希望 bin 和Z都匹配，所以可以从您的输入数据中得出结果？ What I can see, there are no bins encapsulating 1.1 in df2 that also has Z=F2 .我所看到的是，在df2中没有封装 1.1 的容器也有Z=F2 。 Apologies if I didn't understand you question.如果我不明白你的问题，我深表歉意。

Note that I renamed the columns for the bin limits in df2 as you can't use Python keywords in numexpr query.请注意，我在df2中重命名了 bin 限制的列，因为您不能在 numexpr 查询中使用 Python 关键字。

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'X':[1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9,
                         2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9],
                    'Z':['F1','F2','F2','F1','F1','F2','F2','F1','F2','F2',
                         'F1','F1','F1','F1','F1','F1','F1','F1','F1','F1']})

df2 = pd.DataFrame({
           'bmin': [1.0,   1.5,   1.8,      2.2,   2.6],
           'bmax': [1.5,   1.8,   2.2,      2.6,   2.9],
           'Z':    ['F1',  'F1',  'F2',     'F1',  'F2'],
           'Y':    ['foo', 'bar', 'foobar', 'foo', 'zoo']
})

merged = pd.merge(
    df1, 
    df2,
    on='Z',
)
merged = merged.query('bmin <= X < bmax')
merged = merged.sort_values(by="X")[['X', 'Z', 'Y']]

Gives the output给出 output

      X   Z       Y
0   1.0  F1     foo
3   1.3  F1     foo
6   1.4  F1     foo
10  1.7  F1     bar
50  1.8  F2  foobar
52  1.9  F2  foobar
20  2.2  F1     foo
23  2.3  F1     foo
26  2.4  F1     foo
29  2.5  F1     foo

应用一个 function 和两个 dataframe 作为参数

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-08-19 11:09:25

Better solution using `pd.cut()`使用`pd.cut()`更好的解决方案

Old solution using `merge` followed by `query`使用`merge`后跟`query`的旧解决方案

应用一个 function 和两个 dataframe 作为参数

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-08-19 11:09:25

Better solution using pd.cut()使用pd.cut()更好的解决方案

Old solution using merge followed by query使用merge后跟query的旧解决方案

解决方案1
1 已采纳 2021-08-19 11:09:25

Better solution using `pd.cut()`使用`pd.cut()`更好的解决方案

Old solution using `merge` followed by `query`使用`merge`后跟`query`的旧解决方案