简体   繁体   English

熊猫数据帧矢量化采样

[英]Pandas dataframe vectorized sampling

I have a simple df forming a pivot_table: 我有一个简单的df形成pivot_table:

    d = {'one' : ['A', 'B', 'B', 'C', 'C', 'C'], 'two' : [6., 5., 4., 3., 2., 1.],     'three' : [6., 5., 4., 3., 2., 1.], 'four' : [6., 5., 4., 3., 2., 1.]}
    df = pd.DataFrame(d)
    pivot = pd.pivot_table(df,index=['one','two'])

I would like to randomly sample 1 row from each different element from column 'one' of the resulting pivot object. 我想从结果数据透视对象的第一列中的每个不同元素中随机抽取1行。 (In this example, 'A' will always be sampled while there are more options for 'B' and 'C'.) I just began using the 0.18.0 version of pandas and am aware of the .sample method. (在这个例子中,'A'将始终被采样,而'B'和'C'有更多选项。)我刚开始使用0.18.0版本的pandas并且知道.sample方法。 I messed with the .groupby method applying a sampling function something like this: 我混淆了.groupby方法,应用了这样的采样函数:

    grouped = pivot.groupby('one').apply(lambda x: x.sample(n=1, replace=False))

I raise a KeyError when I tried variations on that theme so I thought it was time for some fresh perspective on this seemingly simple question... 当我尝试使用该主题的变体时,我提出了一个KeyError,所以我认为是时候对这个看似简单的问题有一些全新的看法......

Thanks for any assistance! 谢谢你的帮助!

The KeyError is raised since 'one' is not a column in pivot but the name of an index: 引发KeyError,因为'one'不是pivot的列,而是索引的名称:

In [11]: pivot
Out[11]:
         four  three
one two
A   6.0   6.0    6.0
B   4.0   4.0    4.0
    5.0   5.0    5.0
C   1.0   1.0    1.0
    2.0   2.0    2.0
    3.0   3.0    3.0

You have to use the level argument: 你必须使用level参数:

In [12]: pivot.groupby(level='one').apply(lambda x: x.sample(n=1, replace=False))
Out[12]:
             four  three
one one two
A   A   6.0   6.0    6.0
B   B   4.0   4.0    4.0
C   C   1.0   1.0    1.0

This isn't quite right since the index is repeated! 这是不正确的,因为索引重复! It's slightly better with as_index=False : 使用as_index=False稍微好一些:

In [13]: pivot.groupby(level='one', as_index=False).apply(lambda x: x.sample(n=1))
Out[13]:
           four  three
  one two
0 A   6.0   6.0    6.0
1 B   4.0   4.0    4.0
2 C   2.0   2.0    2.0

Note: This picks a random row each time. 注意:每次选择一个随机


As an alternative, a potentially more performant variant (that pulls out a subframe: 作为替代方案,可能更具性能的变体(拉出子帧:

In [21]: df.iloc[[np.random.choice(x) for x in g.indices.values()]]
Out[21]:
   four one  three  two
1   5.0   B    5.0  5.0
3   3.0   C    3.0  3.0
0   6.0   A    6.0  6.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM