[英]Sampling in pandas
If I want to randomly sample a pandas dataframe I can use pandas.DataFrame.sample . 如果我想随机采样一个熊猫数据框,可以使用pandas.DataFrame.sample 。
Suppose I randomly sample 80% of the rows. 假设我随机抽取80%的行。 How do I automatically get the other 20% of the rows that were not picked?
如何自动获取未选择的其他20%的行?
As Lagerbaer explains, one can add a column with a unique index to the dataframe, or randomly shuffle the entire dataframe. 正如Lagerbaer解释的那样,可以向数据框添加一列具有唯一索引的列,或随机地对整个数据框进行随机排序。 For the latter,
对于后者,
df.reindex(np.random.permutation(df.index))
works. 作品。 (np means numpy)
(np表示numpy)
>>> import pandas as pd, numpy as np
>>> df = pd.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b': [11,12,13,14,15,16,17,18,19,20]})
>>> df
a b
0 1 11
1 2 12
2 3 13
3 4 14
4 5 15
5 6 16
6 7 17
7 8 18
8 9 19
9 10 20
# randomly sample 5 rows
>>> sample = df.sample(5)
>>> sample
a b
7 8 18
2 3 13
4 5 15
0 1 11
3 4 14
# list comprehension to get indices not in sample's indices
>>> idxs_not_in_sample = [idx for idx in df.index if idx not in sample.index]
>>> idxs_not_in_sample
[1, 5, 6, 8, 9]
# locate the rows at the indices in the original dataframe that aren't in the sample
>>> not_sample = df.loc[idxs_not_in_sample]
>>> not_sample
a b
1 2 12
5 6 16
6 7 17
8 9 19
9 10 20
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.