[英]Get top n columns and ignore NaN
我在忽略 NaN 的情況下努力獲得前 n 列(在我的情況下為 n = 3)。 我的數據集:
import numpy as np
import pandas as pd
x = {'ID':['1','2','3','4','5'],
'productA':[0.47, 0.65, 0.48, 0.58, 0.67],
'productB':[0.65,0.47,0.55, np.NaN, np.NaN],
'productC':[0.78, np.NaN, np.NaN, np.NaN, np.NaN],
'productD':[np.NaN, np.NaN, 0.25, np.NaN, np.NaN],
'productE':[0.12, np.NaN, 0.47, 0.12, np.NaN]}
df = pd.DataFrame(x)
我想要的結果:
ID | 前3名 |
---|---|
A1 | 產品C - 產品B - 產品A |
A2 | 產品A - 產品B |
A3 | 產品B - 產品A- 產品E |
A4 | 產品A - 產品E |
A5 | 產品A |
如您所見,如果 n < 3,它應該保留 n 的任何值,但按它們的值排序。 我嘗試了 np.argsort 但它不會忽略 NaN,而是按字母順序對缺失的產品進行排序。
嘗試使用:
df.set_index("ID").apply(
lambda x: pd.Series(x.nlargest(3).index).tolist(), axis=1
)
ID
1 [productC, productB, productA]
2 [productA, productB]
3 [productB, productA, productE]
4 [productA, productE]
5 [productA]
dtype: object
您可以將np.argsort
與np.isnan
一起使用來過濾掉NaN
。 然后只需boolean indexing
就可以了。
arr = df.iloc[:, 1:].to_numpy() # Leaving out `ID` col
idx = arr.argsort(axis=1)
m = np.isnan(arr)
m = m[np.arange(arr.shape[0])[:,None], idx]
out = df.columns[1:].to_numpy()[idx]
out = [v[~c][-3:] for v, c in zip(out, m)]
pd.Series(out, index= df['ID'])
ID
1 [productA, productB, productC]
2 [productB, productA]
3 [productE, productA, productB]
4 [productE, productA]
5 [productA]
dtype: object
df.apply
over axis=1
只是底層的for-loop
,可能很慢。 但是您可以利用NumPy函數(矢量化)來獲得一些效率。
In [152]: %%timeit
...: df.set_index('ID').apply(lambda x: pd.Series(x.nlargest(3).index).toli
...: st(), axis=1)
...:
...:
2.04 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [153]: %%timeit
...: arr = df.iloc[:, 1:].to_numpy() # Leaving out `ID` col
...: idx = arr.argsort(axis=1)
...: m = np.isnan(arr)
...: m = m[np.arange(arr.shape[0])[:,None], idx]
...: out = df.columns[1:].to_numpy()[idx]
...: out = [v[~c][-3:] for v, c in zip(out, m)]
...:
...:
144 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
幾乎 14 倍的性能提升。
我建議直接使用 numpy。
根據您的經驗,您可能會發現它有點混亂和混亂(我當然會)
import numpy as np
# your data
d = {
'productA':[0.47, 0.65, 0.48, 0.58, 0.67],
'productB':[0.65,0.47,0.55, np.NaN, np.NaN],
'productC':[0.78, np.NaN, np.NaN, np.NaN, np.NaN],
'productD':[np.NaN, np.NaN, 0.25, np.NaN, np.NaN],
'productE':[0.12, np.NaN, 0.47, 0.12, np.NaN]
}
# replae your nans with -infs as otherwise they are counted as high
for k,v in d.items():
d[k] = [-np.inf if i is np.NaN else i for i in v]
# store as a matrix
matrix = np.array(list(d.values()))
# your ids are 1 to 5
for i in range(1, 6):
print(f"ID: {i}")
# arg sort axis=0 will order how you want (by ooing over the horizontal axis)
# you then want to select the i-1th column [::, i-1]
# and do reverse order [::-1]
print(np.argsort(matrix, axis=0)[::, i - 1][::-1])
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.