[英]Efficiently replace values from a column to another column Pandas DataFrame
I have a Pandas DataFrame like this:我有一个像这样的 Pandas DataFrame:
col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0 0.4 0.4
4 0 0 0.3
5 0 0 0
6 0.1 0.4 0.4
I want to replace the col1
values with the values in the second column ( col2
) only if col1
values are equal to 0, and after (for the zero values remaining), do it again but with the third column ( col3
).只有当
col1
值等于 0 时,我才想用第二列 ( col2
) 中的值替换col1
值,然后(对于剩余的零值),再次执行此操作,但使用第三列 ( col3
)。 The Desired Result is the next one:期望的结果是下一个:
col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0.4 0.4 0.4
4 0.3 0 0.3
5 0 0 0
6 0.1 0.4 0.4
I did it using the pd.replace
function, but it seems too slow.. I think must be a faster way to accomplish that.我使用
pd.replace
函数完成了它,但它似乎太慢了。我认为必须是一种更快的方法来完成它。
df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)
is there a faster way to do that?, using some other function instead of the pd.replace
function?有更快的方法吗?使用其他函数而不是
pd.replace
函数?
Using np.where
is faster.使用
np.where
更快。 Using a similar pattern as you used with replace
:使用与
replace
类似的模式:
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
However, using a nested np.where
is slightly faster:但是,使用嵌套的
np.where
稍微快一些:
df['col1'] = np.where(df['col1'] == 0,
np.where(df['col2'] == 0, df['col3'], df['col2']),
df['col1'])
Timings计时
Using the following setup to produce a larger sample DataFrame and timing functions:使用以下设置生成更大的示例 DataFrame 和计时函数:
df = pd.concat([df]*10**4, ignore_index=True)
def root_nested(df):
df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
return df
def root_split(df):
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
return df
def pir2(df):
df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
return df
def pir2_2(df):
slc = (df.values != 0).argmax(axis=1)
return df.values[np.arange(slc.shape[0]), slc]
def andrew(df):
df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
return df
def pablo(df):
df['col1'] = df['col1'].replace(0,df['col2'])
df['col1'] = df['col1'].replace(0,df['col3'])
return df
I get the following timings:我得到以下时间:
%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop
%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop
%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop
%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop
%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop
I tried timing your method, but it's been running for multiple minutes without completing.我尝试为您的方法计时,但它已经运行了几分钟而没有完成。 As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.
作为比较,仅在 6 行示例 DataFrame(不是上面测试的更大的那个)上计时您的方法需要 12.8 毫秒。
I'm not sure if it's faster, but you're right that you can slice the dataframe to get your desired result.我不确定它是否更快,但你是对的,你可以对数据框进行切片以获得你想要的结果。
df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
print(df)
Output:输出:
col1 col2 col3
0 0.2 0.3 0.3
1 0.2 0.3 0.3
2 0.4 0.4 0.4
3 0.3 0.0 0.3
4 0.0 0.0 0.0
5 0.1 0.4 0.4
Alternatively if you want it to be more terse (though I don't know if it's faster) you can combine what you did with what I did.或者,如果您希望它更简洁(尽管我不知道它是否更快),您可以将您所做的与我所做的结合起来。
df.col1[df.col1 == 0] = df.col2.replace(0, df.col3)
print(df)
Output:输出:
col1 col2 col3
0 0.2 0.3 0.3
1 0.2 0.3 0.3
2 0.4 0.4 0.4
3 0.3 0.0 0.3
4 0.0 0.0 0.0
5 0.1 0.4 0.4
approach using pd.DataFrame.where
and pd.DataFrame.bfill
使用
pd.DataFrame.where
和pd.DataFrame.bfill
方法
df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
df
Another approach using np.argmax
使用
np.argmax
另一种方法
def pir2(df):
slc = (df.values != 0).argmax(axis=1)
return df.values[np.arange(slc.shape[0]), slc]
I know there is a better way to use numpy
to slice.我知道有更好的方法来使用
numpy
进行切片。 I just can't think of it at the moment.我只是暂时想不出来。
Generally speaking, there are three type of methods to do this conditionally replacement task.一般来说,有三种方法可以完成这种有条件的替换任务。 They are:
他们是:
numpy.where
pandas.Series.mask
or pandas.Series.where
which is the opposite of Series.mask
pandas.Series.mask
或pandas.Series.where
与Series.mask
相反pandas.DataFrame.loc
You can try pandas.Series.mask
你可以试试
pandas.Series.mask
df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col2'])
df['col1'] = df['col1'].mask(df['col1'].eq(0), df['col3'])
col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0.4 0.4 0.4
4 0.3 0.0 0.3
5 0.0 0.0 0.0
6 0.1 0.4 0.4
Or pandas.Series.where
或
pandas.Series.where
df['col1'] = df['col1'].where(df['col1'].ne(0), df['col2'])
df['col1'] = df['col1'].where(df['col1'].ne(0), df['col3'])
At last, you can try loc
最后,你可以试试
loc
df.loc[df['col1'].eq(0), 'col1'] = df['col2']
df.loc[df['col1'].eq(0), 'col1'] = df['col3']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.