Python：從數據幀熊貓中獲取隨機數據

Question

有一個 df 值：

name     algo      accuracy
tom       1         88
tommy     2         87
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

如何從 df 中隨機選取 4 條記錄，條件是應從每個唯一的算法列值中選取至少一條記錄。 在這里，算法列只有 3 個唯一值（1、2、3）

示例輸出：

name     algo      accuracy
tom       1         88
tommy     2         87
stuart    3         100
lincoln   1         88

示例輸出2：

name     algo      accuracy
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

Answer 1

單程

num_sample, num_algo = 4, 3

# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)

# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )

另一種方法是打亂整個數據，枚舉每個算法中的行，按該枚舉排序並獲取所需數量的樣本。 這比第一種方法代碼略多，但更便宜並且產生更平衡的算法計數：

# shuffle data
df_random = df['algo'].sample(frac=1)

# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()

# sort with `np.argsort`:
enums = enums.sort_values()

# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]

Python：從數據幀熊貓中獲取隨機數據

問題描述

1 個解決方案

解決方案1
3 2020-10-29 19:54:59

Python：從數據幀熊貓中獲取隨機數據

問題描述

1 個解決方案

解決方案1 3 2020-10-29 19:54:59

解決方案1
3 2020-10-29 19:54:59