简体   繁体   English

数据框根据行类型将行的平衡分配应用于列表

[英]Dataframe apply balanced allocation of rows to lists based on row type

I have a dataframe with 192 rows , each rows represent a sentence with some metadata and a specific type (H or L).我有一个包含 192 行的数据框,每行代表一个带有一些元数据和特定类型(H 或 L)的句子。 So, for each type I have total of 96 sentences.因此,对于每种类型,我总共有 96 个句子。 I need to allocate them to 10 different lists with the following conditions:我需要将它们分配到 10 个不同的列表,条件如下:

  1. Each list has 6 sub-lists with the size of 16 each每个列表有 6 个子列表,每个子列表的大小为 16
  2. Each sub-list has equal number of sentences from each type (8 H and 8 L)每个子列表具有相同数量的每种类型的句子(8 H 和 8 L)
  3. Each list has exactly one sentence from each of the 96 sets (So one sentence from each set and total of 96 sentences per list, 48 H and 48 L)每个列表都有来自 96 个集合中的每个集合的一个句子(所以每个集合中有一个句子,每个列表总共 96 个句子,48 个 H 和 48 个 L)
  4. Across the 10 lists, for each set , there will be 5 sentences from the type H and 5 sentences type L. Meaning, there will be a balance of the number of allocation of type per specific condition.在 10 个列表中,对于每个集合,将有 5 个来自 H 类型的句子和 5 个来自 L 类型的句子。意思是,每个特定条件的类型分配数量将保持平衡。

    So, if we take set#1 as an example - the sentence " I went to work" will appear in 5 lists (anywhere in the list - meaning the sublist doesn't matter at all in this point) , and the sentence "She went to work" will appear in the other 5 lists.因此,如果我们以 set#1 为例 - 句子“我去上班”将出现在 5 个列表中(列表中的任何位置 - 这意味着子列表在这一点上根本不重要),而句子“She去上班”将出现在其他5 个列表中。

  5. The allocation of specific sentence per list and sub-list shoud be as randomly as possible每个列表和子列表的特定句子的分配应尽可能随机

And example of a dataframe is:数据框的示例是:

SetNum       Sentence         Type   Index
1          I went to work      H       0
1          She went to work    L       1
2          I drink coffee      H       2
2          She drinks coffee   L       3
3          The desk is red     H       4
3          The desk is white   L       5
4          The TV is big       H       6
4          The TV is white     L       7
5          This is a car       H       8
5          This is a plane     L       9

..
96         Good morning        H       194
96         Good night          L       195

How can it be done?怎么做到呢?

Thanks!谢谢!

Try this:尝试这个:

import random
from itertools import chain
import numpy as np
# groups by type 'H' and 'L', makes list, and assigns to corresponding variables
H_, L_ = df.groupby('Type')['Sentence'].agg(list)

# initialize empty dict (actually list, I did it for better visualization)
dict_of_list = {}

# loop 10 times for creating each list
for j in range(10):
    random_idx = random.sample(range(96), 48)
    H_idx = np.isin(range(96), random_idx)
    L_idx = ~H_idx
    H, L = np.array(H_)[H_idx].tolist(), np.array(L_)[L_idx].tolist()
    H, L = random.sample(H, len(H)), random.sample(L,len(L))

    # then zip together and chain them, to make a sequence of [H, L, H, L, ...]
    pair_wise_list = list(chain(*zip(H,L)))

    # zip(*[iter(pair_wise_list)]*16) divides the entire list in sublist of 16
    # more about zip(*[iter(pair_wise_list)]*16) in reference below
    # random.sample(list(i),len(i)) adds more randomness in positions of H,L in sublist
    lst = [random.sample(list(i),len(i)) for i in zip(*[iter(pair_wise_list)]*16)]

    dict_of_list[j] = lst

I copied 10 rows.我复制了 10 行。 Made 10 list, each containing 5 sublists, which contain 2 sentences from each type.制作了 10 个列表,每个列表包含 5 个子列表,每个列表包含 2 个句子。 Across the 10 list, each sentence is repeated only once, and balanced.在 10 列表中,每个句子只重复一次,并且是平衡的。

>>> df
   SetNum           Sentence Type  Index
0       1     I went to work    H      0
1       1   She went to work    L      1
2       2     I drink coffee    H      2
3       2  She drinks coffee    L      3
4       3    The desk is red    H      4
5       3  The desk is white    L      5
6       4      The TV is big    H      6
7       4    The TV is white    L      7
8       5      This is a car    H      8
9       5    This is a plane    L      9

>>> import random
>>> from itertools import chain
>>> H, L = df.groupby('Type')['Sentence'].agg(list)

>>> dict_of_list = {}
>>> for j in range(10):
...     H, L = random.sample(H, len(H)), random.sample(L, len(L))
...     pair_wise_list = list(chain(*zip(H,L)))
...     lst = [random.sample(list(i),len(i)) for i in zip(*[iter(pair_wise_list)]*2)] # had to change to 2
...     dict_of_list[j] = lst
>>> dict_of_list

{0: [['The desk is white', 'I drink coffee'],
  ['This is a car', 'The TV is white'],
  ['She went to work', 'The TV is big'],
  ['I went to work', 'She drinks coffee'],
  ['This is a plane', 'The desk is red']],
 1: [['She went to work', 'The TV is big'],
  ['I went to work', 'The desk is white'],
  ['This is a car', 'This is a plane'],
  ['The TV is white', 'The desk is red'],
  ['I drink coffee', 'She drinks coffee']],
 2: [['The desk is red', 'The TV is white'],
  ['The TV is big', 'She drinks coffee'],
  ['I went to work', 'This is a plane'],
  ['She went to work', 'This is a car'],
  ['The desk is white', 'I drink coffee']],
 3: [['The desk is red', 'The TV is white'],
  ['I drink coffee', 'She drinks coffee'],
  ['She went to work', 'I went to work'],
  ['This is a car', 'This is a plane'],
  ['The desk is white', 'The TV is big']],
 4: [['I went to work', 'This is a plane'],
  ['The desk is red', 'She drinks coffee'],
  ['The TV is white', 'This is a car'],
  ['The TV is big', 'The desk is white'],
  ['She went to work', 'I drink coffee']],
 5: [['She drinks coffee', 'This is a car'],
  ['She went to work', 'I went to work'],
  ['The desk is white', 'The TV is big'],
  ['I drink coffee', 'The TV is white'],
  ['The desk is red', 'This is a plane']],
 6: [['This is a plane', 'The TV is big'],
  ['She drinks coffee', 'I went to work'],
  ['She went to work', 'This is a car'],
  ['I drink coffee', 'The TV is white'],
  ['The desk is white', 'The desk is red']],
 7: [['The desk is red', 'She drinks coffee'],
  ['This is a car', 'The TV is white'],
  ['The TV is big', 'She went to work'],
  ['I went to work', 'The desk is white'],
  ['This is a plane', 'I drink coffee']],
 8: [['I went to work', 'She went to work'],
  ['I drink coffee', 'The desk is white'],
  ['The TV is big', 'The TV is white'],
  ['The desk is red', 'She drinks coffee'],
  ['This is a plane', 'This is a car']],
 9: [['She went to work', 'The TV is big'],
  ['I went to work', 'The TV is white'],
  ['I drink coffee', 'The desk is white'],
  ['This is a plane', 'This is a car'],
  ['She drinks coffee', 'The desk is red']]}

EDIT : To get all the contents, change the first few lines to this:编辑:要获取所有内容,请将前几行更改为:

import random
from itertools import chain
# groups by type 'H' and 'L', makes list, and assigns to corresponding variables
df2 = df.copy()
df2['joined'] = df.astype(str).agg(', '.join,1)
H, L = df2.groupby('Type')['joined'].agg(list)

Reference:参考:

How does zip(*[iter(s)]*n) work in Python? zip(*[iter(s)]*n) 在 Python 中如何工作?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM