简体   繁体   English

按列中的值对 dataframe 进行采样并保留所有行

[英]Sample dataframe by value in column and keep all rows

I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.我想使用某个列中的值对 Pandas dataframe 进行采样,但我想保留样本中的值的所有行。

For example, in the dataframe below I want to randomly sample some fraction of the values in b , but keep all corresponding rows in a and c .例如,在下面的 dataframe 中,我想随机抽取b中值的一部分,但保留ac中的所有相应行。

d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )

Desired example output from a 16% sample:来自 16% 样本的所需示例 output:

Out[66]: 
     a   b   c
0    1   0   0
1   26   0  50
2   51   0   0
3   76   0  50
4    4  12   6
5   29  12  56
6   54  12   6
7   79  12  56
8   18  68  34
9   43  68  84
10  68  68  34
11  93  68  84
12  19  72  36
13  44  72  86
14  69  72  36
15  94  72  86

I've tried sampling the series and merging back to the main data, like this:我已经尝试对系列进行采样并合并回主要数据,如下所示:

In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))

This creates the desired output, but it seems inefficient.这将创建所需的 output,但似乎效率低下。 My real dataset has millions of values in b and hundreds of millions of rows.我的真实数据集在b中有数百万个值和数亿行。 I know I could also use some version of ``isin```, but that also is slow.我知道我也可以使用一些版本的“isin”,但这也很慢。

Is there a more efficient way to do this?有没有更有效的方法来做到这一点?

I really doubt that isin is slow:我真的怀疑isin很慢:

uniques = df.b.unique()

# this maybe the bottle neck
samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )

# sampling here
df[df.b.isin(samples)]

You can profile the steps above.您可以分析上述步骤。 In case samples=... is slow, you can try:如果samples=...很慢,您可以尝试:

idx = np.random.rand(len(uniques))
samples = uniques[idx<0.16]

Those took about 100 ms on my system on 10 million rows.这些在我的系统上花费了大约 100 毫秒,有 1000 万行。

Note : dbsample(int(.16 * dbnunique())) does not sample 0.16 of the unique values in b .注意dbsample(int(.16 * dbnunique()))不会对b中的0.16个唯一值进行采样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM