简体   繁体   中英

Dataframe apply balanced allocation of rows to lists based on row type

I have a dataframe with 192 rows , each rows represent a sentence with some metadata and a specific type (H or L). So, for each type I have total of 96 sentences. I need to allocate them to 10 different lists with the following conditions:

  1. Each list has 6 sub-lists with the size of 16 each
  2. Each sub-list has equal number of sentences from each type (8 H and 8 L)
  3. Each list has exactly one sentence from each of the 96 sets (So one sentence from each set and total of 96 sentences per list, 48 H and 48 L)
  4. Across the 10 lists, for each set , there will be 5 sentences from the type H and 5 sentences type L. Meaning, there will be a balance of the number of allocation of type per specific condition.

    So, if we take set#1 as an example - the sentence " I went to work" will appear in 5 lists (anywhere in the list - meaning the sublist doesn't matter at all in this point) , and the sentence "She went to work" will appear in the other 5 lists.

  5. The allocation of specific sentence per list and sub-list shoud be as randomly as possible

And example of a dataframe is:

SetNum       Sentence         Type   Index
1          I went to work      H       0
1          She went to work    L       1
2          I drink coffee      H       2
2          She drinks coffee   L       3
3          The desk is red     H       4
3          The desk is white   L       5
4          The TV is big       H       6
4          The TV is white     L       7
5          This is a car       H       8
5          This is a plane     L       9

..
96         Good morning        H       194
96         Good night          L       195

How can it be done?

Thanks!

Try this:

import random
from itertools import chain
import numpy as np
# groups by type 'H' and 'L', makes list, and assigns to corresponding variables
H_, L_ = df.groupby('Type')['Sentence'].agg(list)

# initialize empty dict (actually list, I did it for better visualization)
dict_of_list = {}

# loop 10 times for creating each list
for j in range(10):
    random_idx = random.sample(range(96), 48)
    H_idx = np.isin(range(96), random_idx)
    L_idx = ~H_idx
    H, L = np.array(H_)[H_idx].tolist(), np.array(L_)[L_idx].tolist()
    H, L = random.sample(H, len(H)), random.sample(L,len(L))

    # then zip together and chain them, to make a sequence of [H, L, H, L, ...]
    pair_wise_list = list(chain(*zip(H,L)))

    # zip(*[iter(pair_wise_list)]*16) divides the entire list in sublist of 16
    # more about zip(*[iter(pair_wise_list)]*16) in reference below
    # random.sample(list(i),len(i)) adds more randomness in positions of H,L in sublist
    lst = [random.sample(list(i),len(i)) for i in zip(*[iter(pair_wise_list)]*16)]

    dict_of_list[j] = lst

I copied 10 rows. Made 10 list, each containing 5 sublists, which contain 2 sentences from each type. Across the 10 list, each sentence is repeated only once, and balanced.

>>> df
   SetNum           Sentence Type  Index
0       1     I went to work    H      0
1       1   She went to work    L      1
2       2     I drink coffee    H      2
3       2  She drinks coffee    L      3
4       3    The desk is red    H      4
5       3  The desk is white    L      5
6       4      The TV is big    H      6
7       4    The TV is white    L      7
8       5      This is a car    H      8
9       5    This is a plane    L      9

>>> import random
>>> from itertools import chain
>>> H, L = df.groupby('Type')['Sentence'].agg(list)

>>> dict_of_list = {}
>>> for j in range(10):
...     H, L = random.sample(H, len(H)), random.sample(L, len(L))
...     pair_wise_list = list(chain(*zip(H,L)))
...     lst = [random.sample(list(i),len(i)) for i in zip(*[iter(pair_wise_list)]*2)] # had to change to 2
...     dict_of_list[j] = lst
>>> dict_of_list

{0: [['The desk is white', 'I drink coffee'],
  ['This is a car', 'The TV is white'],
  ['She went to work', 'The TV is big'],
  ['I went to work', 'She drinks coffee'],
  ['This is a plane', 'The desk is red']],
 1: [['She went to work', 'The TV is big'],
  ['I went to work', 'The desk is white'],
  ['This is a car', 'This is a plane'],
  ['The TV is white', 'The desk is red'],
  ['I drink coffee', 'She drinks coffee']],
 2: [['The desk is red', 'The TV is white'],
  ['The TV is big', 'She drinks coffee'],
  ['I went to work', 'This is a plane'],
  ['She went to work', 'This is a car'],
  ['The desk is white', 'I drink coffee']],
 3: [['The desk is red', 'The TV is white'],
  ['I drink coffee', 'She drinks coffee'],
  ['She went to work', 'I went to work'],
  ['This is a car', 'This is a plane'],
  ['The desk is white', 'The TV is big']],
 4: [['I went to work', 'This is a plane'],
  ['The desk is red', 'She drinks coffee'],
  ['The TV is white', 'This is a car'],
  ['The TV is big', 'The desk is white'],
  ['She went to work', 'I drink coffee']],
 5: [['She drinks coffee', 'This is a car'],
  ['She went to work', 'I went to work'],
  ['The desk is white', 'The TV is big'],
  ['I drink coffee', 'The TV is white'],
  ['The desk is red', 'This is a plane']],
 6: [['This is a plane', 'The TV is big'],
  ['She drinks coffee', 'I went to work'],
  ['She went to work', 'This is a car'],
  ['I drink coffee', 'The TV is white'],
  ['The desk is white', 'The desk is red']],
 7: [['The desk is red', 'She drinks coffee'],
  ['This is a car', 'The TV is white'],
  ['The TV is big', 'She went to work'],
  ['I went to work', 'The desk is white'],
  ['This is a plane', 'I drink coffee']],
 8: [['I went to work', 'She went to work'],
  ['I drink coffee', 'The desk is white'],
  ['The TV is big', 'The TV is white'],
  ['The desk is red', 'She drinks coffee'],
  ['This is a plane', 'This is a car']],
 9: [['She went to work', 'The TV is big'],
  ['I went to work', 'The TV is white'],
  ['I drink coffee', 'The desk is white'],
  ['This is a plane', 'This is a car'],
  ['She drinks coffee', 'The desk is red']]}

EDIT : To get all the contents, change the first few lines to this:

import random
from itertools import chain
# groups by type 'H' and 'L', makes list, and assigns to corresponding variables
df2 = df.copy()
df2['joined'] = df.astype(str).agg(', '.join,1)
H, L = df2.groupby('Type')['joined'].agg(list)

Reference:

How does zip(*[iter(s)]*n) work in Python?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM