Pandas 根据其他列中的值更改列中的值

Question

I have a dataframe in which one column represents some data, the other column represents indices on which I want to delete from my data.我有一个 dataframe ，其中一列代表一些数据，另一列代表我想从我的数据中删除的索引。 So starting from this:所以从这个开始：

import pandas as pd
import numpy as np

df = pd.DataFrame({'data':[np.arange(1,5),np.arange(3)],'to_delete': [np.array([2]),np.array([0,2])]})
df
>>>> data       to_delete
     [1,2,3,4]    [2]
     [0,1,2]     [0,2]

This is what I want to end up with:这就是我想要结束的：

new_df
>>>>   data     to_delete
     [1,2,4]       [2]
       [1]        [0,2]

I could iterate over the rows by hand and calculate the new data for each one like this:我可以手动遍历行并计算每个行的新数据，如下所示：

new_data = []
for _,v in df.iterrows():
    foo = np.delete(v['data'],v['to_delete'])
    new_data.append(foo)
df.assign(data=new_data)

but I'm looking for a better way to do this.但我正在寻找一种更好的方法来做到这一点。

Answer 1

The overhead from calling a numpy function for each row will really worsen the performance here.为每一行调用 numpy function 的开销确实会恶化这里的性能。 I'd suggest you to go with lists instead:我建议您使用列表代替 go：

df['data'] = [[j for ix, j in enumerate(i[0]) if ix not in i[1]] 
              for i in df.values]

print(df)

       data to_delete
0  [1, 2, 4]       [2]
1        [1]    [0, 2]

Timings on a 20K row dataframe: 20K行 dataframe 上的时序：

df_large = pd.concat([df]*10000, axis=0)

%timeit [[j for ix, j in enumerate(i[0]) if ix not in i[1]] 
         for i in df_large.values]
# 184 ms ± 12.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit 
new_data = []
for _,v in df_large.iterrows():
    foo = np.delete(v['data'],v['to_delete'])
    new_data.append(foo)

# 5.44 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df_large.apply(lambda row: np.delete(row["data"], 
                       row["to_delete"]), axis=1)
# 5.29 s ± 340 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 2

You should use the apply function in order to apply a function to every row in the dataframe:您应该使用apply function 以便将 function 应用到 dataframe 中的每一行：

df["data"] = df.apply(lambda row: np.delete(row["data"], row["to_delete"]), axis=1)

Answer 3

An other solution based on starmap:基于星图的另一种解决方案：

This solution is based on a less known tool from the itertools module called starmap .该解决方案基于itertools模块中一个鲜为人知的工具，称为starmap 。

Check its doc, it's worth a try!查看它的文档，值得一试！

import pandas as pd
import numpy as np
from itertools import starmap

df = pd.DataFrame({'data': [np.arange(1,5),np.arange(3)],
                   'to_delete': [np.array([2]),np.array([0,2])]})

# Solution: 
df2 = df.copy()
A = list(starmap(lambda v,l: np.delete(v,l),
                             zip(df['data'],df['to_delete'])))

df2['data'] = pd.DataFrame(zip(A))
df2

prints out:打印出来：

        data to_delete
0  [1, 2, 4]       [2]
1        [1]    [0, 2]

Pandas 根据其他列中的值更改列中的值

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-04-07 21:20:27

解决方案2
1 2020-04-07 21:15:50

解决方案3
0 2020-04-07 21:43:39

An other solution based on starmap:基于星图的另一种解决方案：

Pandas 根据其他列中的值更改列中的值

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-04-07 21:20:27

解决方案2 1 2020-04-07 21:15:50

解决方案3 0 2020-04-07 21:43:39

An other solution based on starmap:基于星图的另一种解决方案：

解决方案1
2 已采纳 2020-04-07 21:20:27

解决方案2
1 2020-04-07 21:15:50

解决方案3
0 2020-04-07 21:43:39