简体   繁体   English

您如何将列表分成根据正态分布变化的块

[英]How do you divide up a list into chunks which vary according to a normal distribution

I want to take a list of thousands of items and group them into 12 chunks, where the number of items found in each chunk correspond to a normal distribution (bell curve) and no duplicates across chunks - the list must exhaust itself .我想列出数千个项目并将它们分组为 12 个块,其中每个块中找到的项目数对应于正态分布(钟形曲线)并且块之间没有重复项 - 列表必须自行耗尽

Input data looks like this输入数据看起来像这样

['6355ab76f70c5c59749f2018',
 '6355c797f70c5c5974a1cb15',
 '6355d256f70c5c5974a36a6c',
 '6355d270f70c5c5974a37356',
 '6355d29bf70c5c5974a3810a',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d371f70c5c5974a3c236',
 '6355d389f70c5c5974a3c828',
 '6355d94df70c5c5974a55450',
 '6355d956f70c5c5974a556c1',
 '6355d987f70c5c5974a5626d',
 '6355d99df70c5c5974a566d9',
 '6355d9b1f70c5c5974a56b5c',
 '6355d9bbf70c5c5974a56d50',
 '6355d9d3f70c5c5974a572e1',
 '6355d9fdf70c5c5974a57c53',
 '6355da0cf70c5c5974a57f8f',
 '6355da11f70c5c5974a58065',
 '6355da19f70c5c5974a58261',
 '6355da68f70c5c5974a592ca',
 '6355da6cf70c5c5974a593ab',
 '6355da80f70c5c5974a597de',
 '6355da8af70c5c5974a599fa',
 '6355da93f70c5c5974a59c09',
 '6355da98f70c5c5974a59d20',
 '6355daa1f70c5c5974a59ec9',
 '6355daa7f70c5c5974a59fec',
 '6355dac5f70c5c5974a5a6dd',
 '6355dadaf70c5c5974a5ab75',
 '6355dafcf70c5c5974a5b2dc',
 '6355db6df70c5c5974a5d24b',
 '6355dba0f70c5c5974a5dfea',
 '6355dc16f70c5c5974a5fe14',
 '6355dc31f70c5c5974a6059d',
 '6355dc37f70c5c5974a60782',
 '6355dc3cf70c5c5974a608eb',
 '6355dc41f70c5c5974a60a99',
 '6355dc47f70c5c5974a60bb9',
 '6355dc5cf70c5c5974a611ef',
 '6355dc67f70c5c5974a61578',
 '6355dcaaf70c5c5974a62831',
 '6355dcb4f70c5c5974a62b2c',
 '6355dcbff70c5c5974a62e73',
 '6355dcc8f70c5c5974a63113',
 '6355dcd7f70c5c5974a6355c',
 '6355dcf3f70c5c5974a63c91',
 '6355dcf7f70c5c5974a63de9',
 '6355dd04f70c5c5974a64144',
 '6355dd0ef70c5c5974a64438',
 '6355dd53f70c5c5974a65902',
 '6355dd61f70c5c5974a65cf6',
 '6355dd6bf70c5c5974a66010',
 '6355dd70f70c5c5974a66195',
 '6355dd74f70c5c5974a662f9',
 '6355dd98f70c5c5974a66d4e',
 '6355dd9df70c5c5974a66e99',
 '6355dda2f70c5c5974a66fbd',
 '6355ddb0f70c5c5974a673e4',
 '6355ddbaf70c5c5974a67638',
 '6355ddc5f70c5c5974a6796b',
 '6355ddcef70c5c5974a67bcf',
 '6355de01f70c5c5974a6892c',
 '6355de15f70c5c5974a68ecf',
 '6355de1bf70c5c5974a69023',
 '6355de3df70c5c5974a699ad',
 '6355de58f70c5c5974a6a1ab',
 '6355de62f70c5c5974a6a4df',
 '6355de6bf70c5c5974a6a787',
 '6355de9cf70c5c5974a6b5a8',
 '6355dea0f70c5c5974a6b6ed',
 '6355deccf70c5c5974a6c3dc',
 '6355ded4f70c5c5974a6c602',
 '6355dee8f70c5c5974a6cbd2',
 '6355e8f1f70c5c5974a9db18',
 '6355e924f70c5c5974a9ec85',
 '6355e9dbf70c5c5974aa2b37',
 '6355eaaef70c5c5974aa7348',
 '6355ead5f70c5c5974aa81ac',
 '6355ec02f70c5c5974aaefaa',
 '6355ec64f70c5c5974ab135d',
 '6355ec8df70c5c5974ab2157',
 '6355ecb2f70c5c5974ab2ce7',
 '6355eccaf70c5c5974ab346f',
 '6355eccff70c5c5974ab3691',
 '6355ecd3f70c5c5974ab376b',
 '6355ece2f70c5c5974ab3ba0',
 '6355eceef70c5c5974ab3efb',
 '6355ecfef70c5c5974ab4384',
 '6355ed03f70c5c5974ab44c3',
 '6355ed24f70c5c5974ab4f4f',
 '6355ed4cf70c5c5974ab5b39',
 '6355ed78f70c5c5974ab6840',
 '6355ed9ff70c5c5974ab7388',
 '6355edb1f70c5c5974ab7888',
 '6355edb3f70c5c5974ab790b']

What output should look like... output 应该是什么样子...

I am looking for output like this, a list of objects with a numerical key corresponding to a number from 0-11, with the chunked list items as the keys:我正在寻找这样的 output,一个对象列表,其数字键对应于 0-11 之间的数字,分块列表项作为键:

[
    { 0: ['6355ab76f70c5c59749f2018', '6355c797f70c5c5974a1cb15', '6355d256f70c5c5974a36a6c' ] },
    { 1: ['6355d270f70c5c5974a37356',
 '6355d29bf70c5c5974a3810a',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d371f70c5c5974a3c236',
 '6355d389f70c5c5974a3c828'] },
    ...
]

The output chunks should be along the same gradients as this image, even on both sides and greater near the center, for n size list...对于 n 大小列表,output 块应该沿着与此图像相同的梯度,甚至在两侧并且在中心附近更大......

在此处输入图像描述 It should lump the input list into even (on both sides) chunks, with incrementally, in a gradient mathematical way, more per chunk leading toward the center of the output list.它应该将输入列表分成均匀的(在两侧)块,以梯度数学方式递增,每个块更多指向 output 列表的中心。

I want it so the list I pass in is divided so that the most amount of items are grouped in the middle (numbers 4-8 roughly) and that it less items are grouped together as they reach the "edges" of the resulting list (numbers 0-3, and numbers 9-12).我想要它,所以我传入的列表被分开,这样最多的项目被分组在中间(大致数字 4-8),并且当它们到达结果列表的“边缘”时,较少的项目被分组在一起(数字 0-3 和数字 9-12)。 But everything of the input list must be exhausted so the items are fully distributed in this way.但是输入列表的所有内容都必须用尽,因此项目以这种方式完全分布。

I tried to tackle this with numpy but so far I have not been able to get the output I want.我试图用numpy解决这个问题,但到目前为止我还没有得到我想要的 output。

My current code (two different functions):我当前的代码(两个不同的功能):

        
def divide_list_normal(lst):
    normal_dist = np.random.normal(size=len(lst)) # Generate a normal distribution of numbers
    sorted_list = [x for _,x in sorted(zip(normal_dist,lst))] # Sort the list according to the normal distribution
    chunk_size = int(len(lst)/len(normal_dist)) # Divide the list into chunks
    chunks = [sorted_list[i:i+chunk_size] for i in range(0, len(sorted_list), chunk_size)]
    return chunks 

def divide_list_normal_define_chunk_size(lst, n):
    normal_dist = np.random.normal(size=len(lst)) # Generate a normal distribution of numbers
    sorted_list = [x for _,x in sorted(zip(normal_dist,lst))] # Sort the list according to the normal distribution
    chunk_size = int(len(lst)/len(normal_dist)) # Divide the list into chunks
    chunks = [sorted_list[i:i+chunk_size] for i in range(0, n, chunk_size)]
    return chunks

The output for the first comes out like so:第一个的 output 如下所示:

[['63a8d83336756fd65d455c77'],
 ['6355f7c6f70c5c5974adfbce'],
 ['635629c6f70c5c5974bbab53'],
 ['6355fa8bf70c5c5974aeb70f'],
 ['6355dcd7f70c5c5974a6355c'],
 ['63a96dae36756fd65d549333'],
 ['639245927eeb4e9fd025e397'],
 ['63562463f70c5c5974ba3b5c'],
 ['63a8e04736756fd65d4635cf'],
 ['635629a5f70c5c5974bba1c1'],
 ['6355f74ef70c5c5974addd2c'],...]

The output for the second comes out like so:第二个 output 如下所示:

[['63aa1a9d36756fd65d7566cf'],
 ['6355ed78f70c5c5974ab6840'],
 ['63a94e1836756fd65d500d5d'],
 ['63a8e23e36756fd65d4667ec'],
 ['63a96c6536756fd65d5463db'],
 ['63d39021d34efb9c0983d64a'],
 ['635627a9f70c5c5974bb1573'],
 ['63b3a4c236756fd65d33750a'],
 ['63562320f70c5c5974b9e50b'],
 ['63aa1aec36756fd65d758676'],
 ['63a9551636756fd65d5111fb'],
 ['63562443f70c5c5974ba31ed']]

Is there a way to divide up a list into chunks which vary according to a normal distribution?有没有办法将列表分成根据正态分布变化的块? If you know how, please share it.如果你知道怎么做,请分享。 Thank you!谢谢你!

This works, although it may be slow depending on your requirements这可行,但根据您的要求可能会很慢

import numpy as np
from itertools import islice


testList = ['6355d29bf70c5c5974a3810a',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
  '6355d36cf70c5c5974a3c103',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d300f70c5c5974a3a202',
 '6355d31af70c5c5974a3ab03',
 '6355d36cf70c5c5974a3c103',
 '6355d371f70c5c5974a3c236',
 '6355d389f70c5c5974a3c828']

normal_dist = np.random.normal(size=len(testList),loc=10,scale=4) 
sorted_list = [list(islice(testList, int(x))) for x in normal_dist] 

One thing you have to watch out for is since these are slices of a list, the normal distribution can't be out of bounds, ie: 0<loc-scale<len(testList)您必须注意的一件事是,因为这些是列表的切片,正态分布不能越界,即:0<loc-scale<len(testList)

For each index i, find the CDF of i+0.5 and then subtract the CDF of i-.5.对于每个索引 i,找到 i+0.5 的 CDF,然后减去 i-.5 的 CDF。 That will be the percentage of the list you should put in that index.那将是您应该放入该索引的列表的百分比。 For the first index, you'll just have the CDF of i+.5, and not subtract the CDF of i-.5, and for the last index, you just have the CDF of i-.5, and subtract that from 1 rather than the CDF of i+.5.对于第一个索引,您只有 i+.5 的 CDF,而不是减去 i-.5 的 CDF,对于最后一个索引,您只有 i-.5 的 CDF,并从 1 中减去它而不是 i+.5 的 CDF。 You'll want the mean to be the middle of your indices, and choose the standard deviation according to how spread out you want it (you'll probably want it somewhere around one fourth the number of indices, but it's up to you).您会希望均值成为指数的中间值,并根据您想要的分布方式选择标准差(您可能希望它大约是指数数量的四分之一,但这取决于您)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM