如何将公式应用于 Dataframe pandas 中的所有列

Question

I have the following Dataframe:我有以下 Dataframe：

import pandas as pd
data = {'MA1': [ float("nan"),  float("nan"),      -1,   1],
        'MA2': [ float("nan"),            -1,       0,   0],
        'MA3': [            0,             0,       1,  -1]}
df_input = pd.DataFrame(data, columns=['MA1', 'MA2', 'MA3'])

My goal is for every column, if the first non nan and non zero value is -1, to set it to 0.我的目标是对于每一列，如果第一个非 nan 和非零值是 -1，则将其设置为 0。

Clarification:澄清：

The goal is only to set to 0 if the first non 0 and non nan value is -1.如果第一个非 0 和非 nan 值为 -1，则目标仅设置为 0。 If it is 1 or anything else, then leave it there.如果它是 1 或其他任何值，则将其留在那里。

What is the fastest way to do it?最快的方法是什么？

Answer 1

You can loop over the columns and use DataFrame.loc to assign the 0 when the first valid value is -1 :当第一个有效值为-1时，您可以遍历列并使用DataFrame.loc分配 0 ：

dft = df_input.replace(0, np.NaN)

for col in df_input.columns:
    idxmin = dft[col].idxmin()
    if df_input.loc[idxmin, col] == -1:
        df_input.loc[idxmin, col] = 0

   MA1  MA2  MA3
0  NaN  NaN    0
1  NaN  0.0    0
2  0.0  0.0    1
3  1.0  0.0    0

Or more efficient by using DataFrame.idxmin instead so we dont have to to call Series.idxmin for each iteration in our loop:或者通过使用DataFrame.idxmin来提高效率，因此我们不必为循环中的每次迭代调用Series.idxmin ：

dft = df_input.replace(0, np.NaN).idxmin()

for col, idx in dft.iteritems():
    if df_input.loc[idx, col] == -1:
        df_input.loc[idx, col] = 0

   MA1  MA2  MA3
0  NaN  NaN    0
1  NaN  0.0    0
2  0.0  0.0    1
3  1.0  0.0    0

Answer 2

Being at the end of one year using python, I'm trying to be better at implementing higher performing solutions, so I thought I would test the performance of my answer versus other's (realizing that my answer would be the slowest -- from the dataframe I created , it ended up being 50,000x slower than the best answer. Woah,): Also, here is a good article about pandas and performance: https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6在使用 python 的一年结束时，我试图更好地实施性能更高的解决方案，所以我想我会测试我的答案与其他答案的性能（意识到我的答案将是最慢的——来自 dataframe我创建了，它最终比最佳答案慢了50,000x 。哇，）：另外，这是一篇关于 pandas 和性能的好文章： https://engineering.upside.com/a-beginners-guide-to-optimizing -pandas-code-for-speed-c09ef2c6a4d6

My traditional slow looping method looped through 3 columns almost 100,000 times (length of dataframe), while the best answer looped through 3 columns one time as it idx.min() identified the relevant row, making it unnecessary to loop through them all.我传统的慢速循环方法循环遍历 3 列几乎 100,000 次（数据帧的长度），而最佳答案循环遍历 3 列一次，因为它idx.min()识别了相关行，因此无需遍历所有行。

Here is a dataframe with 100,000 rows and 4 columns that I used to test vs. @Erfan and @DerekO:这是一个 dataframe，有 100,000 行和 4 列，我用来测试与@Erfan 和@DerekO：

df_input = pd.DataFrame(np.random.randint(0, 10, size=(100000,4)).astype(float), columns=list('ABCD'))
df_input.iloc[99998:, 0:4] = -1

My Answer (slowest) 2.78 s ± 269 ms per loop :我的答案（最慢） 2.78 s ± 269 ms per loop ：

for col in df_input.columns:
    for row in range(len(df_input.index)):
        if df_input.loc[row, col] == -1:
            df_input.loc[row, col] = 0
            break    
df_input

Derek O's answer #1: 283 ms ± 13.2 ms per loop 10x faster than my answer! Derek O 的答案 #1： 283 ms ± 13.2 ms per loop比我的答案快 10 倍！

Erfan's answer #1: 2.73 ms ± 135 µs per loop 1,000x faster than my answer! Erfan 的答案 #1： 2.73 ms ± 135 µs per loop比我的答案快 1,000 倍！

Erfan's answer #2: 54.8 µs ± 5.65 µs per loop 50,000x faster than my answer! Erfan 的答案 #2： 54.8 µs ± 5.65 µs per loop比我的答案快 50,000 倍！

Answer 3

Apply a custom function to each column.将自定义 function 应用于每一列。 The custom function loops through the column's values to find the first non-nan, non-zero value, then returns the new column.自定义 function 循环遍历列的值以查找第一个非 nan、非零值，然后返回新列。

import numpy as np
import pandas as pd

def set_column(col_values):
    for index, value in enumerate(col_values):
        if value != 0 and not np.isnan(value):
            if value == -1:
                col_values[index] = 0
                return col_values
            else:
                return col_values

data = {'MA1': [ float("nan"),  float("nan"),      -1,   1],
        'MA2': [ float("nan"),            -1,       0,   0],
        'MA3': [            0,             0,       1,   0]}

df_input = pd.DataFrame(data, columns=['MA1', 'MA2', 'MA3'])
df_output = df_input.copy().apply(lambda x: set_column(x), axis = 0)

Output: Output：

>>> df_output
   MA1  MA2  MA3
0  NaN  NaN    0
1  NaN  0.0    0
2  0.0  0.0    1
3  1.0  0.0    0

Answer 4

I used a modification of @Erfan's answer.我使用了@Erfan 答案的修改。

As I explain in my Update edit, I want to only set it to zero if the first non zero and non nan value is -1.正如我在更新编辑中解释的那样，如果第一个非零和非 nan 值为 -1，我只想将其设置为零。 If it anything else, then don't do anything for that column.如果还有其他内容，则不要为该列做任何事情。

df_min = df_input(0, np.NaN).idxmin()
df_max = df_input(0, np.NaN).idxmax()
for col, idx in df_min.iteritems():
    if df_input[idx, col] == -1 and idx < df_max[col]:
        df_input[idx, col] = 0

如何将公式应用于 Dataframe pandas 中的所有列

问题描述

4 个解决方案

解决方案1
3 2020-04-12 00:20:24

解决方案2
1 2020-04-12 02:18:29

解决方案3
0 2020-04-12 00:23:39

解决方案4
0 2020-04-12 09:27:58

如何将公式应用于 Dataframe pandas 中的所有列

问题描述

4 个解决方案

解决方案1 3 2020-04-12 00:20:24

解决方案2 1 2020-04-12 02:18:29

解决方案3 0 2020-04-12 00:23:39

解决方案4 0 2020-04-12 09:27:58

解决方案1
3 2020-04-12 00:20:24

解决方案2
1 2020-04-12 02:18:29

解决方案3
0 2020-04-12 00:23:39

解决方案4
0 2020-04-12 09:27:58