Pandas：使用从预先存在的列计算的值在数据框中创建两个新列

Question

I am working with the pandas library and I want to add two new columns to a dataframe df with n columns (n > 0).我正在使用pandas库，我想将两个新列添加到具有 n 列（n > 0）的数据框df 。
These new columns result from the application of a function to one of the columns in the dataframe.这些新列是将函数应用于数据帧中的列之一而产生的。

The function to apply is like:要应用的函数是这样的：

def calculate(x):
    ...operate...
    return z, y

One method for creating a new column for a function returning only a value is:为仅返回值的函数创建新列的一种方法是：

df['new_col']) = df['column_A'].map(a_function)

So, what I want, and tried unsuccesfully (*), is something like:所以，我想要的，但尝试不成功（*），是这样的：

(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)

What the best way to accomplish this could be ?实现这一目标的最佳方法是什么？ I scanned the documentation with no clue.我在没有任何线索的情况下扫描了文档。

** df['column_A'].map(calculate) returns a pandas Series each item consisting of a tuple z, y. ** df['column_A'].map(calculate)返回一个df['column_A'].map(calculate)系列，每个项目由一个元组 z, y 组成。 And trying to assign this to two dataframe columns produces a ValueError.*并尝试将其分配给两个数据框列会产生 ValueError.*

Answer 1

I'd just use zip :我只是使用zip ：

In [1]: from pandas import *

In [2]: def calculate(x):
   ...:     return x*2, x*3
   ...: 

In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})

In [4]: df
Out[4]: 
   a  b
0  1  2
1  2  3
2  3  4

In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))

In [6]: df
Out[6]: 
   a  b  A1  A2
0  1  2   2   3
1  2  3   4   6
2  3  4   6   9

Answer 2

The top answer is flawed in my opinion.在我看来，最佳答案是有缺陷的。 Hopefully, no one is mass importing all of pandas into their namespace with from pandas import * .希望没有人使用from pandas import *将所有熊猫大量导入到他们的命名空间中。 Also, the map method should be reserved for those times when passing it a dictionary or Series.此外，在传递字典或系列时，应该为那些时间保留map方法。 It can take a function but this is what apply is used for.它可以接受一个函数，但这就是apply的用途。

So, if you must use the above approach, I would write it like this所以，如果你一定要使用上面的方法，我会这样写

df["A1"], df["A2"] = zip(*df["a"].apply(calculate))

There's actually no reason to use zip here.实际上没有理由在这里使用 zip。 You can simply do this:你可以简单地这样做：

df["A1"], df["A2"] = calculate(df['a'])

This second method is also much faster on larger DataFrames第二种方法在较大的 DataFrame 上也快得多

df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})

DataFrame created with 300,000 rows用 300,000 行创建的 DataFrame

%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

60x faster than zip比 zip 快 60 倍

In general, avoid using apply一般来说，避免使用 apply

Apply is generally not much faster than iterating over a Python list. Apply 通常不会比迭代 Python 列表快多少。 Let's test the performance of a for-loop to do the same thing as above让我们测试一个 for 循环的性能来做和上面一样的事情

%%timeit
A1, A2 = [], []
for val in df['a']:
    A1.append(val**2)
    A2.append(val**3)

df['A1'] = A1
df['A2'] = A2

298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So this is twice as slow which isn't a terrible performance regression, but if we cythonize the above, we get much better performance.所以这是慢了两倍，这并不是一个可怕的性能回归，但是如果我们对上述内容进行 cythonize，我们会获得更好的性能。 Assuming, you are using ipython:假设您正在使用 ipython：

%load_ext cython

%%cython
cpdef power(vals):
    A1, A2 = [], []
    cdef double val
    for val in vals:
        A1.append(val**2)
        A2.append(val**3)

    return A1, A2

%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Directly assigning without apply直接分配无需申请

You can get even greater speed improvements if you use the direct vectorized operations.如果您使用直接矢量化操作，您可以获得更大的速度提升。

%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This takes advantage of NumPy's extremely fast vectorized operations instead of our loops.这利用了 NumPy 极快的矢量化操作而不是我们的循环。 We now have a 30x speedup over the original.我们现在比原始速度提高了 30 倍。

The simplest speed test with `apply`最简单的速度测试`apply`

The above example should clearly show how slow apply can be, but just so its extra clear let's look at the most basic example.上面的例子应该清楚地展示了apply速度有多慢，但为了更清楚，让我们看一下最基本的例子。 Let's square a Series of 10 million numbers with and without apply让我们在有和没有应用的情况下对一系列 1000 万个数字进行平方

s = pd.Series(np.random.rand(10000000))

%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without apply is 50x faster没有应用速度快 50 倍

%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas：使用从预先存在的列计算的值在数据框中创建两个新列

问题描述

2 个解决方案

解决方案1
125 已采纳 2012-09-10 17:20:49

解决方案2
49 2017-11-03 18:08:47

In general, avoid using apply一般来说，避免使用 apply

Directly assigning without apply直接分配无需申请

The simplest speed test with `apply`最简单的速度测试`apply`

Pandas：使用从预先存在的列计算的值在数据框中创建两个新列

问题描述

2 个解决方案

解决方案1 125 已采纳 2012-09-10 17:20:49

解决方案2 49 2017-11-03 18:08:47

In general, avoid using apply一般来说，避免使用 apply

Directly assigning without apply直接分配无需申请

The simplest speed test with apply最简单的速度测试apply

解决方案1
125 已采纳 2012-09-10 17:20:49

解决方案2
49 2017-11-03 18:08:47

The simplest speed test with `apply`最简单的速度测试`apply`