简体   繁体   English

Numpy随机选择多个循环

[英]Numpy random choice multiple loops

I am trying to perform multiple simulations many times to get a desired simulation distribution. 我试图多次执行多次仿真以获得所需的仿真分布。 I have a dataset that looks like the one below. 我有一个数据集,看起来像下面的数据集。

fruit_type, reading, prob
Apple, 12,.05
apple, 15, .5
orange 18, .99

An example of my code is below. 我的代码示例如下。

def sim(seconds):
    output = pd.DataFrame()
    current = []
    #output = pd.DataFrame()
    for i in range(1, 100000000):
        if data2['fruit_type'].all() == 'Apple':
            hostrecord1 = np.random.choice(data2['reading'], size=23, replace=True, p=data2['prob'])
            current = hostrecord1.sum() + 150

        if data2['fruit_type'].all() == 'Orange':
            hostrecord2 = np.random.choice(data2['reading'], size=23, replace=True, p=data2['prob'])
            current = hostrecord2.sum() + 150

        if data2['fruit_type'].all() == 'Peach':
            hostrecord3 = np.random.choice(data2['reading'], size=20, replace=True, p=data2['prob'])
            current = hostrecord3.sum() + 150

    #put all records in one array
    #return all records 
    output = pd.concat(current)
    return output

I am trying to figure out how to perform multiple simulations with different conditions varying by fruit_type , but currently can't figure out the logic. 我试图找出如何在不同的条件下(根据fruit_type改变)执行多个仿真,但是目前无法弄清楚逻辑。 Each simulation should select specific rows in relation to the fruit_type so the simulations are specified by fruit_type so part of it. 每个模拟都应相对于fruit_type选择特定的行,因此模拟是由fruit_type指定的,因此是其一部分。 The size of each sample is different by design as each fruit_type has different conditions. 每个样本的大小在设计上会有所不同,因为每种fruit_type具有不同的条件。

My expected output is an array of all the simulation values. 我的预期输出是所有模拟值的数组。 I also want to append all the results into one pandas dataframe. 我还想将所有结果附加到一个熊猫数据框中。

Your explanation is pretty unclear, but here's a guess: 您的解释尚不清楚,但这是一个猜测:

# initialize data
In [1]: fruits = ['apple', 'peach', 'orange']
In [2]: data = np.vstack((np.random.choice(fruits, size=10), 
                          np.random.randint(0, 100, size=10), 
                          np.random.rand(10))).T
In [3]: df = pd.DataFrame(data, columns=['fruit_type', 'reading', 'prob'])

The key is indexing df such that df[df.fruit_type == fruit_of_interest] . 关键是索引df ,使df[df.fruit_type == fruit_of_interest] Here is a sample function: 这是一个示例函数:

def simulate(df, N_trials):
    # replace with actual sizes for ['apple', 'peach', 'orange'] respectively
    sample_sizes = [N1, N2, N3]
    fruits = ['apple', 'peach', 'orange']

    results = np.empty((N_trials, len(fruits))
    for i in xrange(N_trials): # switch to range if using python3
        for j, (fruit, size) in enumerate(zip(fruits, sample_sizes)):
            sim_data = df[df.fruit_type == fruit]
            record = np.random.choice(sim_data.reading, size=size, p=sim_data.prob)
            # do something with the record
            results[i, j] = record.sum()

Note that the results array may be too big to fit in memory if you're doing 100 million trials. 请注意,如果您要进行1亿次试验,结果数组可能太大而无法容纳在内存中。 It may also be faster if you swap the for loops so the fruit/size one is the outermost for loop. 如果交换for循环,则结果可能也会更快,因此水果/大小之一是最外层的for循环。


It's also worth noting that instead of for -looping, you could always generate a huge sample with np.random.choice and then reshape: 还值得注意的是,除了for -looping之外,您始终可以使用np.random.choice生成一个巨大的样本,然后重塑:

np.random.choice([0, 1], size=1000000).reshape(10000, 100)

would give you 10000 trials with 100 samples each. 会给您10000次试用,每个试用100个样本。 This could be useful if your 100 million trials is taking too long -- you could split that into 100 loops with choice doing 1 million samples at once. 如果您的1亿次试验耗时太长,这可能很有用-您可以将其分成100个循环,并choice一次执行100万个样本。 An example could be 一个例子可能是

def simulate(df, N_trials, chunk_size=10000):
    # replace with actual sizes for ['apple', 'peach', 'orange'] respectively
    sample_sizes = [N1, N2, N3]
    fruits = ['apple', 'peach', 'orange']

    for i in xrange(N_trials/chunk_size): # switch to range if using python3
        chunk_results = np.empty((chunk_size, len(fruits))
        for j, (fruit, size) in enumerate(zip(fruits, sample_sizes)):
            sim_data = df[df.fruit_type == fruit]
            record = np.random.choice(sim_data.reading, size=(chunk_size, size), 
                                      p=sim_data.prob)
            chunk_results[:, j] = record.sum(axis=1)

        # do something intermediate with this chunk

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM