简体   繁体   English

Pandas:使用从预先存在的列计算的值在数据框中创建两个新列

[英]Pandas: create two new columns in a dataframe with values calculated from a pre-existing column

I am working with the pandas library and I want to add two new columns to a dataframe df with n columns (n > 0).我正在使用pandas库,我想将两个新列添加到具有 n 列(n > 0)的数据框df
These new columns result from the application of a function to one of the columns in the dataframe.这些新列是将函数应用于数据帧中的列之一而产生的。

The function to apply is like:要应用的函数是这样的:

def calculate(x):
    ...operate...
    return z, y

One method for creating a new column for a function returning only a value is:为仅返回值的函数创建新列的一种方法是:

df['new_col']) = df['column_A'].map(a_function)

So, what I want, and tried unsuccesfully (*), is something like:所以,我想要的,但尝试不成功(*),是这样的:

(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)

What the best way to accomplish this could be ?实现这一目标的最佳方法是什么? I scanned the documentation with no clue.我在没有任何线索的情况下扫描了文档

** df['column_A'].map(calculate) returns a pandas Series each item consisting of a tuple z, y. ** df['column_A'].map(calculate)返回一个df['column_A'].map(calculate)系列,每个项目由一个元组 z, y 组成。 And trying to assign this to two dataframe columns produces a ValueError.*并尝试将其分配给两个数据框列会产生 ValueError.*

I'd just use zip :我只是使用zip

In [1]: from pandas import *

In [2]: def calculate(x):
   ...:     return x*2, x*3
   ...: 

In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})

In [4]: df
Out[4]: 
   a  b
0  1  2
1  2  3
2  3  4

In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))

In [6]: df
Out[6]: 
   a  b  A1  A2
0  1  2   2   3
1  2  3   4   6
2  3  4   6   9

The top answer is flawed in my opinion.在我看来,最佳答案是有缺陷的。 Hopefully, no one is mass importing all of pandas into their namespace with from pandas import * .希望没有人使用from pandas import *将所有熊猫大量导入到他们的命名空间中。 Also, the map method should be reserved for those times when passing it a dictionary or Series.此外,在传递字典或系列时,应该为那些时间保留map方法。 It can take a function but this is what apply is used for.它可以接受一个函数,但这就是apply的用途。

So, if you must use the above approach, I would write it like this所以,如果你一定要使用上面的方法,我会这样写

df["A1"], df["A2"] = zip(*df["a"].apply(calculate))

There's actually no reason to use zip here.实际上没有理由在这里使用 zip。 You can simply do this:你可以简单地这样做:

df["A1"], df["A2"] = calculate(df['a'])

This second method is also much faster on larger DataFrames第二种方法在较大的 DataFrame 上也快得多

df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})

DataFrame created with 300,000 rows用 300,000 行创建的 DataFrame

%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

60x faster than zip比 zip 快 60 倍


In general, avoid using apply一般来说,避免使用 apply

Apply is generally not much faster than iterating over a Python list. Apply 通常不会比迭代 Python 列表快多少。 Let's test the performance of a for-loop to do the same thing as above让我们测试一个 for 循环的性能来做和上面一样的事情

%%timeit
A1, A2 = [], []
for val in df['a']:
    A1.append(val**2)
    A2.append(val**3)

df['A1'] = A1
df['A2'] = A2

298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So this is twice as slow which isn't a terrible performance regression, but if we cythonize the above, we get much better performance.所以这是慢了两倍,这并不是一个可怕的性能回归,但是如果我们对上述内容进行 cythonize,我们会获得更好的性能。 Assuming, you are using ipython:假设您正在使用 ipython:

%load_ext cython

%%cython
cpdef power(vals):
    A1, A2 = [], []
    cdef double val
    for val in vals:
        A1.append(val**2)
        A2.append(val**3)

    return A1, A2

%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Directly assigning without apply直接分配无需申请

You can get even greater speed improvements if you use the direct vectorized operations.如果您使用直接矢量化操作,您可以获得更大的速度提升。

%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This takes advantage of NumPy's extremely fast vectorized operations instead of our loops.这利用了 NumPy 极快的矢量化操作而不是我们的循环。 We now have a 30x speedup over the original.我们现在比原始速度提高了 30 倍。


The simplest speed test with apply最简单的速度测试apply

The above example should clearly show how slow apply can be, but just so its extra clear let's look at the most basic example.上面的例子应该清楚地展示了apply速度有多慢,但为了更清楚,让我们看一下最基本的例子。 Let's square a Series of 10 million numbers with and without apply让我们在有和没有应用的情况下对一系列 1000 万个数字进行平方

s = pd.Series(np.random.rand(10000000))

%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Without apply is 50x faster没有应用速度快 50 倍

%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 创建包含来自现有列的计算值的多个数据框列 - Create multiple dataframe columns containing calculated values from an existing column 使用来自其他列的两个计算值的最大值创建 Pandas 列 - Create Pandas column with the max of two calculated values from other columns 使用 dataframe 中 3 个预先存在的列中的“def”编写 python function; 第 1 列和第 2 列作为输入 = 第 3 列作为 output - write a python function using ```def``` from 3 pre-existing columns in a dataframe; columns 1 and 2 as inputs = column 3 as output 无法从现有的两个列在 Pandas dataframe 中创建新列 - Unable to make a new column in Pandas dataframe from two existing columns 使用 pandas/python 从 DataFrame 中的两个现有文本列创建一个新列 - Create a new column from two existing text columns in a DataFrame using pandas/python 在由现有列中的值组成的数据框中创建新列 - Create a new column in a dataframe consisting of values from existing columns Python / Pandas:如何使用从现有数据框计算出的新变量和值创建结果表 - Python/Pandas: How to create a table of results with new variables and values calculated from an existing dataframe Pandas:在数据框中创建一个新列,其中的值是从现有列 i 计算出来的。 计算最大值 - Pandas: Create a new column in a data frame with values calculated from an already existing column, i. calculate maximum 使用 pandas 对两列进行排序并为 dataframe 中的排序值创建新列 - Sort Two column and create new columns for sorted values from dataframe using pandas 根据其他列值计算出的新的Pandas Dataframe列 - new Pandas Dataframe column calculated from other column values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM