[英]Python : get random data from dataframe pandas
有一個 df 值:
name algo accuracy
tom 1 88
tommy 2 87
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
如何從 df 中隨機選取 4 條記錄,條件是應從每個唯一的算法列值中選取至少一條記錄。 在這里,算法列只有 3 個唯一值(1、2、3)
示例輸出:
name algo accuracy
tom 1 88
tommy 2 87
stuart 3 100
lincoln 1 88
示例輸出2:
name algo accuracy
mark 1 88
stuart 3 100
alex 2 99
lincoln 1 88
單程
num_sample, num_algo = 4, 3
# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)
# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )
另一種方法是打亂整個數據,枚舉每個算法中的行,按該枚舉排序並獲取所需數量的樣本。 這比第一種方法代碼略多,但更便宜並且產生更平衡的算法計數:
# shuffle data
df_random = df['algo'].sample(frac=1)
# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()
# sort with `np.argsort`:
enums = enums.sort_values()
# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.