如何对Python Pandas中同一个dataframe中的两列进行运算？

Question

I'm trying to apply the operation 'xy/y' , being x the column 'Faturamento' and y column 'Custo' from the dataframe called 'df' , and store the results in a new column called 'Roi' .我正在尝试应用操作'xy/y' ， x列'Faturamento'和y列'Custo'来自 dataframe 称为'df' ，并将结果存储在名为'Roi'的新列中。

My attempt to use the apply function:我尝试使用申请 function：

df['Roi'] = df.apply(lambda x, y: x['Faturamento']-y['Custo']/y['Custo'], axis=1)

Is returning:正在返回：

TypeError: () missing 1 required positional argument: 'y' TypeError: () missing 1 required positional argument: 'y'

How can I do this?我怎样才能做到这一点？

Answer 1

You can just use the column operation with syntax like simple arithmetic.您可以只使用简单算术等语法的列操作。 Pandas will automatically align the index for you, so that you are operating row by row for each operation. Pandas会自动为你对齐索引，让你每次操作都在逐行操作。

df['Roi'] = (df['Faturamento'] - df['Custo']) / df['Custo']

or或者

df['Roi'] = df['Faturamento'] / df['Custo'] - 1

This way, you can enjoy the fast vectorized processing of Pandas which has been optimized to run fast.这样，您就可以享受Pandas的快速矢量化处理，它已经过优化以快速运行。 If you use .apply() with lambda function on axis=1 , it's just a slow Python loop in underlying processing, and will be slow.如果您在axis=1上将.apply()与 lambda function 一起使用，它只是底层处理中的一个缓慢的 Python 循环，而且会很慢。

Performance Benchmark性能基准

Test 1. Small df with 4 rows测试 1. 4 行的小型df

   Faturamento  Custo
0           50     20
1           10      5
2            5     15
3          100    400

%%timeit
df['Roi'] = df.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)

721 µs ± 3.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df['Roi'] = df['Faturamento'] / df['Custo'] - 1

490 µs ± 4.83 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Summary: .apply + lambda takes 721 µs while Pandas built-in takes 490 µs : 1.47 times faster for small dataset of.总结： .apply + lambda 需要721 µs而 Pandas built-in 需要490 µs ：小数据集快 1.47 倍。

Test 2. Large df with 40000 rows测试 2. 具有 40000 行的大型df

df2 = pd.concat([df] * 10000, ignore_index=True)

%%timeit
df2['Roi'] = df2.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)

639 ms ± 3.62 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df2['Roi'] = df2['Faturamento'] / df2['Custo'] - 1

767 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Summary: .apply + lambda takes 639 ms (= 639,000 µs) while Pandas built-in takes 767 µs : 833x times faster for large dataset of.摘要： .apply + lambda 耗时639 ms (= 639,000 µs) ，而内置的 Pandas 耗时767 µs ：对于大型数据集，速度快 833 倍。

Answer 2

I think you mean:我想你的意思是：

df['Roi'] = df.apply(lambda x: (x['Faturamento']-x['Custo'])/x['Custo'], axis=1)

x refers to the dataframe x指的是dataframe

如何对Python Pandas中同一个dataframe中的两列进行运算？

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-09-28 15:47:55

Performance Benchmark性能基准

解决方案2
1 2021-09-28 15:47:39

如何对Python Pandas中同一个dataframe中的两列进行运算？

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-09-28 15:47:55

Performance Benchmark性能基准

解决方案2 1 2021-09-28 15:47:39

解决方案1
3 已采纳 2021-09-28 15:47:55

解决方案2
1 2021-09-28 15:47:39