[英]Find missing numbers in a sorted column in Pandas Dataframe
我在下面有這個 dataframe 並且我想檢測每個主題的缺失訪問,如何按主題對訪問進行排序並僅提取具有缺失值的記錄? 請檢查所需的兩種類型的 output。
第 1 部分:因此,根據“訪問”列中所有主題的最高數字,所有主題的缺失記錄需要顯示:
Subject Visit X1 X2
A 1 1647143 1672244
A 2 1672244 1689707
A 4 1689707 1713090
B 1 1735352 1760283
B 2 1760283 1788062
B 7 1788062 1789885
B 9 1789885 1790728
output 將是:
Subject Visit X1 X2
A 3 1647143 1672244
A 5 1672244 1689707
A 6 1689707 1713090
A 7 1647143 1672244
A 8 1672244 1689707
A 9 1689707 1713090
B 3 1735352 1760283
B 4 1760283 1788062
B 5 1788062 1789885
B 6 1789885 1790728
B 8 1789885 1790728
第 2 部分:因此,根據“訪問”列中的最高數字,需要顯示訪問序列中每個特定主題的缺失記錄:示例 Output:
Subject Visit X1 X2
A 3 1647143 1672244
B 3 1735352 1760283
B 4 1760283 1788062
B 5 1788062 1789885
B 6 1789885 1790728
B 8 1789885 1790728
您找到每個主題的缺失訪問,其中每個主題的最大訪問是Visit
列的最大值,您可以創建所有可能(subject, visit)
對的集合,然后區分觀察到的對。
from itertools import product
all_pairs = set(product(sorted(set(df.Subject)), range(1, df.Visit.max()+1)))
observed_pairs = set(tuple(x) for x in df[['Subject', 'Visit']].to_numpy())
# create a data frame from the missing pairs
pd.DataFrame(sorted(all_pairs.difference(observed_pairs)), columns=['Subject', 'Visit'])
# returns:
Subject Visit
0 A 3
1 A 5
2 A 6
3 A 7
4 A 8
5 A 9
6 B 3
7 B 4
8 B 5
9 B 6
10 B 8
在每個主題的最大訪問范圍內查找丟失的訪問。 您可以執行以下操作:
def missing_visits(s):
all_v = set(range(1, s.max()+1))
obs_v = set(s)
return sorted(all_v.difference(obs_v))
df.groupby('Subject')['Visit'].apply(missing_visits).explode()
# returns:
Subject
A 3
B 3
B 4
B 5
B 6
B 8
#Use the min, max in the visit column for each group to reindex df and fillna
g=df.groupby('Subject',group_keys=False).apply(lambda x:x.reindex(np.arange(x['Visit'].min(),x['Visit'].max())).ffill().bfill())
#Update the visit column
g['Visit']=g.index
print(g)
# First outcome
Subject Visit X1 X2
1 A 1 1672244.0 1689707.0
2 A 2 1689707.0 1713090.0
3 A 3 1689707.0 1713090.0
1 B 1 1735352.0 1760283.0
2 B 2 1735352.0 1760283.0
3 B 3 1735352.0 1760283.0
4 B 4 1760283.0 1788062.0
5 B 5 1788062.0 1789885.0
6 B 6 1789885.0 1790728.0
7 B 7 1789885.0 1790728.0
8 B 8 1789885.0 1790728.0
#Filtered outcome
#Create and compare tuples of ['Subject','Visit'] of the original and new dataframes
g[~g[['Subject','Visit']].agg(tuple,1).isin(df[['Subject','Visit']].agg(tuple,1))]
Subject Visit X1 X2
3 A 3 1689707.0 1713090.0
3 B 3 1735352.0 1760283.0
4 B 4 1760283.0 1788062.0
5 B 5 1788062.0 1789885.0
6 B 6 1789885.0 1790728.0
8 B 8 1789885.0 1790728.0
這是data.table
R
> setDT(df)[, .(Visit = setdiff(seq(max(df[, "Visit"])), Visit)), Subject]
Subject Visit
1: A 3
2: A 5
3: A 6
4: A 7
5: A 8
6: A 9
7: B 3
8: B 4
9: B 5
10: B 6
11: B 8
> setDT(df)[, .(Visit = setdiff(seq(max(Visit)), Visit)), Subject]
Subject Visit
1: A 3
2: B 3
3: B 4
4: B 5
5: B 6
6: B 8
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.