基于最后一个索引将 DataFrame 的行设置为 NaN 的更快方法

Question

I've got data frames that look like this:我有看起来像这样的数据框：

import time
import pandas as pd
import numpy as np

N = 3
l = []
for i in range(N):
    n = np.random.choice(5)+2
    l += [pd.DataFrame(dict(ID = np.repeat(i, n),
                            t = list(range(n)),
                            X = np.random.normal(size = n)))]
df = pd.concat(l)

df
Out[85]: 
   ID  t         X
0   0  0  0.992300
1   0  1  0.226487
2   0  2 -0.731178
3   0  3  0.748376
4   0  4  1.269106
0   1  0  0.512957
1   1  1 -1.274963
2   1  2  0.186314
3   1  3  1.243093
0   2  0  0.321971
1   2  1  0.233895
2   2  2  0.293439

I need to set their last value of t for each ID to NaN .我需要将每个ID的最后一个t值设置为NaN 。 Right now, I can do it one of two ways:现在，我可以通过以下两种方式之一来做到这一点：

trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
df = df.drop(columns='X').merge(trimlast, how='left')

Or或者

def f(d):
    d.loc[d.t == d.t.max(), 'X'] = np.nan
    return d

df = df.groupby('ID').apply(f).reset_index(drop=True)

Both of which yield:两者都产生：

df
Out[87]: 
    ID  t         X
0    0  0  0.992300
1    0  1  0.226487
2    0  2 -0.731178
3    0  3  0.748376
4    0  4       NaN
5    1  0  0.512957
6    1  1 -1.274963
7    1  2  0.186314
8    1  3       NaN
9    2  0  0.321971
10   2  1  0.233895
11   2  2       NaN

They're too slow when the data gets big.当数据变大时，它们太慢了。 Time is approx linear.时间近似线性。

def sizetry(N, other_way = False):
    np.random.seed(0)
    l = []
    for i in range(N):
        n = np.random.choice(5) + 2
        l += [pd.DataFrame(dict(ID=np.repeat(i, n),
                                t=list(range(n)),
                                X=np.random.normal(size=n)))]
    df = pd.concat(l)
    start = time.time()
    if other_way:
        trimlast = df.groupby('ID').apply(lambda x: x.head(-1)).reset_index(drop=True)
        df = df.drop(columns='X').merge(trimlast, how='left')
    else:
        df = df.groupby('ID').apply(f).reset_index(drop=True)
    end = time.time()
    return end-start

tvec = [sizetry(2**i) for i in range(15)]
tvec_other = [sizetry(2**i, other_way = True) for i in range(15)]
import matplotlib.pyplot as plt
plt.plot(np.log2(tvec), label = "merge way")
plt.plot(np.log2(tvec_other), label = 'other way')
plt.legend()
plt.show()

I suspect that the problem is groupby .我怀疑问题出在groupby 。 Is there a faster way to do this?有没有更快的方法来做到这一点？

Answer 1

first reset your index.首先重置您的索引。

df = df.reset_index(drop=True)

then use duplicated() with an inversed boolean.然后将duplicated()与倒置的 boolean 一起使用。

import numpy as np
df.loc[~df.duplicated(subset=['ID'],keep='last'),'X'] = np.nan

print(df)

    ID  t         X
0    0  0  0.424902
1    0  1  1.597951
2    0  2  1.453884
3    0  3       NaN
4    1  0  0.534653
5    1  1 -0.318361
6    1  2  0.188290
7    1  3  1.157802
8    1  4       NaN
9    2  0  0.186005
10   2  1  0.036017
11   2  2  1.039822
12   2  3 -1.602205
13   2  4 -0.210601
14   2  5       NaN

if you want the max T value to be changed then use idxmax() with a groupby如果您想更改最大T值，请使用idxmax()和groupby

df.loc[df.groupby('ID')['t'].idxmax(),'x'] = np.nan

基于最后一个索引将 DataFrame 的行设置为 NaN 的更快方法

问题描述

1 个解决方案

解决方案1
7 已采纳 2021-03-20 17:00:22

基于最后一个索引将 DataFrame 的行设置为 NaN 的更快方法

问题描述

1 个解决方案

解决方案1 7 已采纳 2021-03-20 17:00:22

解决方案1
7 已采纳 2021-03-20 17:00:22