[英]Faster way of setting rows of DataFrame to NaN based on last index
I've got data frames that look like this:我有看起来像这样的数据框:
import time
import pandas as pd
import numpy as np
N = 3
l = []
for i in range(N):
n = np.random.choice(5)+2
l += [pd.DataFrame(dict(ID = np.repeat(i, n),
t = list(range(n)),
X = np.random.normal(size = n)))]
df = pd.concat(l)
df
Out[85]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 1.269106
0 1 0 0.512957
1 1 1 -1.274963
2 1 2 0.186314
3 1 3 1.243093
0 2 0 0.321971
1 2 1 0.233895
2 2 2 0.293439
I need to set their last value of t
for each ID
to NaN
.我需要将每个
ID
的最后一个t
值设置为NaN
。 Right now, I can do it one of two ways:现在,我可以通过以下两种方式之一来做到这一点:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
Or或者
def f(d):
d.loc[d.t == d.t.max(), 'X'] = np.nan
return d
df = df.groupby('ID').apply(f).reset_index(drop=True)
Both of which yield:两者都产生:
df
Out[87]:
ID t X
0 0 0 0.992300
1 0 1 0.226487
2 0 2 -0.731178
3 0 3 0.748376
4 0 4 NaN
5 1 0 0.512957
6 1 1 -1.274963
7 1 2 0.186314
8 1 3 NaN
9 2 0 0.321971
10 2 1 0.233895
11 2 2 NaN
They're too slow when the data gets big.当数据变大时,它们太慢了。 Time is approx linear.
时间近似线性。
def sizetry(N, other_way = False):
np.random.seed(0)
l = []
for i in range(N):
n = np.random.choice(5) + 2
l += [pd.DataFrame(dict(ID=np.repeat(i, n),
t=list(range(n)),
X=np.random.normal(size=n)))]
df = pd.concat(l)
start = time.time()
if other_way:
trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')
else:
df = df.groupby('ID').apply(f).reset_index(drop=True)
end = time.time()
return end-start
tvec = [sizetry(2**i) for i in range(15)]
tvec_other = [sizetry(2**i, other_way = True) for i in range(15)]
import matplotlib.pyplot as plt
plt.plot(np.log2(tvec), label = "merge way")
plt.plot(np.log2(tvec_other), label = 'other way')
plt.legend()
plt.show()
I suspect that the problem is groupby
.我怀疑问题出在
groupby
。 Is there a faster way to do this?有没有更快的方法来做到这一点?
first reset your index.首先重置您的索引。
df = df.reset_index(drop=True)
then use duplicated()
with an inversed boolean.然后将
duplicated()
与倒置的 boolean 一起使用。
import numpy as np
df.loc[~df.duplicated(subset=['ID'],keep='last'),'X'] = np.nan
print(df)
ID t X
0 0 0 0.424902
1 0 1 1.597951
2 0 2 1.453884
3 0 3 NaN
4 1 0 0.534653
5 1 1 -0.318361
6 1 2 0.188290
7 1 3 1.157802
8 1 4 NaN
9 2 0 0.186005
10 2 1 0.036017
11 2 2 1.039822
12 2 3 -1.602205
13 2 4 -0.210601
14 2 5 NaN
if you want the max T
value to be changed then use idxmax()
with a groupby
如果您想更改最大
T
值,请使用idxmax()
和groupby
df.loc[df.groupby('ID')['t'].idxmax(),'x'] = np.nan
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.