[英]Pandas: create two new columns in a dataframe with values calculated from a pre-existing column
I am working with the pandas library and I want to add two new columns to a dataframe df
with n columns (n > 0).我正在使用pandas库,我想将两个新列添加到具有 n 列(n > 0)的数据框
df
。
These new columns result from the application of a function to one of the columns in the dataframe.这些新列是将函数应用于数据帧中的列之一而产生的。
The function to apply is like:要应用的函数是这样的:
def calculate(x):
...operate...
return z, y
One method for creating a new column for a function returning only a value is:为仅返回值的函数创建新列的一种方法是:
df['new_col']) = df['column_A'].map(a_function)
So, what I want, and tried unsuccesfully (*), is something like:所以,我想要的,但尝试不成功(*),是这样的:
(df['new_col_zetas'], df['new_col_ys']) = df['column_A'].map(calculate)
What the best way to accomplish this could be ?实现这一目标的最佳方法是什么? I scanned the documentation with no clue.
我在没有任何线索的情况下扫描了文档。
** df['column_A'].map(calculate)
returns a pandas Series each item consisting of a tuple z, y. **
df['column_A'].map(calculate)
返回一个df['column_A'].map(calculate)
系列,每个项目由一个元组 z, y 组成。 And trying to assign this to two dataframe columns produces a ValueError.*并尝试将其分配给两个数据框列会产生 ValueError.*
I'd just use zip
:我只是使用
zip
:
In [1]: from pandas import *
In [2]: def calculate(x):
...: return x*2, x*3
...:
In [3]: df = DataFrame({'a': [1,2,3], 'b': [2,3,4]})
In [4]: df
Out[4]:
a b
0 1 2
1 2 3
2 3 4
In [5]: df["A1"], df["A2"] = zip(*df["a"].map(calculate))
In [6]: df
Out[6]:
a b A1 A2
0 1 2 2 3
1 2 3 4 6
2 3 4 6 9
The top answer is flawed in my opinion.在我看来,最佳答案是有缺陷的。 Hopefully, no one is mass importing all of pandas into their namespace with
from pandas import *
.希望没有人使用
from pandas import *
将所有熊猫大量导入到他们的命名空间中。 Also, the map
method should be reserved for those times when passing it a dictionary or Series.此外,在传递字典或系列时,应该为那些时间保留
map
方法。 It can take a function but this is what apply
is used for.它可以接受一个函数,但这就是
apply
的用途。
So, if you must use the above approach, I would write it like this所以,如果你一定要使用上面的方法,我会这样写
df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
There's actually no reason to use zip here.实际上没有理由在这里使用 zip。 You can simply do this:
你可以简单地这样做:
df["A1"], df["A2"] = calculate(df['a'])
This second method is also much faster on larger DataFrames第二种方法在较大的 DataFrame 上也快得多
df = pd.DataFrame({'a': [1,2,3] * 100000, 'b': [2,3,4] * 100000})
DataFrame created with 300,000 rows用 300,000 行创建的 DataFrame
%timeit df["A1"], df["A2"] = calculate(df['a'])
2.65 ms ± 92.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df["A1"], df["A2"] = zip(*df["a"].apply(calculate))
159 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
60x faster than zip比 zip 快 60 倍
Apply is generally not much faster than iterating over a Python list. Apply 通常不会比迭代 Python 列表快多少。 Let's test the performance of a for-loop to do the same thing as above
让我们测试一个 for 循环的性能来做和上面一样的事情
%%timeit
A1, A2 = [], []
for val in df['a']:
A1.append(val**2)
A2.append(val**3)
df['A1'] = A1
df['A2'] = A2
298 ms ± 7.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So this is twice as slow which isn't a terrible performance regression, but if we cythonize the above, we get much better performance.所以这是慢了两倍,这并不是一个可怕的性能回归,但是如果我们对上述内容进行 cythonize,我们会获得更好的性能。 Assuming, you are using ipython:
假设您正在使用 ipython:
%load_ext cython
%%cython
cpdef power(vals):
A1, A2 = [], []
cdef double val
for val in vals:
A1.append(val**2)
A2.append(val**3)
return A1, A2
%timeit df['A1'], df['A2'] = power(df['a'])
72.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can get even greater speed improvements if you use the direct vectorized operations.如果您使用直接矢量化操作,您可以获得更大的速度提升。
%timeit df['A1'], df['A2'] = df['a'] ** 2, df['a'] ** 3
5.13 ms ± 320 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This takes advantage of NumPy's extremely fast vectorized operations instead of our loops.这利用了 NumPy 极快的矢量化操作而不是我们的循环。 We now have a 30x speedup over the original.
我们现在比原始速度提高了 30 倍。
apply
apply
The above example should clearly show how slow apply
can be, but just so its extra clear let's look at the most basic example.上面的例子应该清楚地展示了
apply
速度有多慢,但为了更清楚,让我们看一下最基本的例子。 Let's square a Series of 10 million numbers with and without apply让我们在有和没有应用的情况下对一系列 1000 万个数字进行平方
s = pd.Series(np.random.rand(10000000))
%timeit s.apply(calc)
3.3 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Without apply is 50x faster没有应用速度快 50 倍
%timeit s ** 2
66 ms ± 2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.