[英]Dataframe apply balanced allocation of rows to lists based on row type
I have a dataframe with 192 rows , each rows represent a sentence with some metadata and a specific type (H or L).我有一个包含 192 行的数据框,每行代表一个带有一些元数据和特定类型(H 或 L)的句子。 So, for each type I have total of 96 sentences.
因此,对于每种类型,我总共有 96 个句子。 I need to allocate them to 10 different lists with the following conditions:
我需要将它们分配到 10 个不同的列表,条件如下:
Across the 10 lists, for each set , there will be 5 sentences from the type H and 5 sentences type L. Meaning, there will be a balance of the number of allocation of type per specific condition.在 10 个列表中,对于每个集合,将有 5 个来自 H 类型的句子和 5 个来自 L 类型的句子。意思是,每个特定条件的类型分配数量将保持平衡。
So, if we take set#1 as an example - the sentence " I went to work" will appear in 5 lists (anywhere in the list - meaning the sublist doesn't matter at all in this point) , and the sentence "She went to work" will appear in the other 5 lists.因此,如果我们以 set#1 为例 - 句子“我去上班”将出现在 5 个列表中(列表中的任何位置 - 这意味着子列表在这一点上根本不重要),而句子“She去上班”将出现在其他5 个列表中。
And example of a dataframe is:数据框的示例是:
SetNum Sentence Type Index
1 I went to work H 0
1 She went to work L 1
2 I drink coffee H 2
2 She drinks coffee L 3
3 The desk is red H 4
3 The desk is white L 5
4 The TV is big H 6
4 The TV is white L 7
5 This is a car H 8
5 This is a plane L 9
..
96 Good morning H 194
96 Good night L 195
How can it be done?怎么做到呢?
Thanks!谢谢!
Try this:尝试这个:
import random
from itertools import chain
import numpy as np
# groups by type 'H' and 'L', makes list, and assigns to corresponding variables
H_, L_ = df.groupby('Type')['Sentence'].agg(list)
# initialize empty dict (actually list, I did it for better visualization)
dict_of_list = {}
# loop 10 times for creating each list
for j in range(10):
random_idx = random.sample(range(96), 48)
H_idx = np.isin(range(96), random_idx)
L_idx = ~H_idx
H, L = np.array(H_)[H_idx].tolist(), np.array(L_)[L_idx].tolist()
H, L = random.sample(H, len(H)), random.sample(L,len(L))
# then zip together and chain them, to make a sequence of [H, L, H, L, ...]
pair_wise_list = list(chain(*zip(H,L)))
# zip(*[iter(pair_wise_list)]*16) divides the entire list in sublist of 16
# more about zip(*[iter(pair_wise_list)]*16) in reference below
# random.sample(list(i),len(i)) adds more randomness in positions of H,L in sublist
lst = [random.sample(list(i),len(i)) for i in zip(*[iter(pair_wise_list)]*16)]
dict_of_list[j] = lst
I copied 10 rows.我复制了 10 行。 Made 10 list, each containing 5 sublists, which contain 2 sentences from each type.
制作了 10 个列表,每个列表包含 5 个子列表,每个列表包含 2 个句子。 Across the 10 list, each sentence is repeated only once, and balanced.
在 10 列表中,每个句子只重复一次,并且是平衡的。
>>> df
SetNum Sentence Type Index
0 1 I went to work H 0
1 1 She went to work L 1
2 2 I drink coffee H 2
3 2 She drinks coffee L 3
4 3 The desk is red H 4
5 3 The desk is white L 5
6 4 The TV is big H 6
7 4 The TV is white L 7
8 5 This is a car H 8
9 5 This is a plane L 9
>>> import random
>>> from itertools import chain
>>> H, L = df.groupby('Type')['Sentence'].agg(list)
>>> dict_of_list = {}
>>> for j in range(10):
... H, L = random.sample(H, len(H)), random.sample(L, len(L))
... pair_wise_list = list(chain(*zip(H,L)))
... lst = [random.sample(list(i),len(i)) for i in zip(*[iter(pair_wise_list)]*2)] # had to change to 2
... dict_of_list[j] = lst
>>> dict_of_list
{0: [['The desk is white', 'I drink coffee'],
['This is a car', 'The TV is white'],
['She went to work', 'The TV is big'],
['I went to work', 'She drinks coffee'],
['This is a plane', 'The desk is red']],
1: [['She went to work', 'The TV is big'],
['I went to work', 'The desk is white'],
['This is a car', 'This is a plane'],
['The TV is white', 'The desk is red'],
['I drink coffee', 'She drinks coffee']],
2: [['The desk is red', 'The TV is white'],
['The TV is big', 'She drinks coffee'],
['I went to work', 'This is a plane'],
['She went to work', 'This is a car'],
['The desk is white', 'I drink coffee']],
3: [['The desk is red', 'The TV is white'],
['I drink coffee', 'She drinks coffee'],
['She went to work', 'I went to work'],
['This is a car', 'This is a plane'],
['The desk is white', 'The TV is big']],
4: [['I went to work', 'This is a plane'],
['The desk is red', 'She drinks coffee'],
['The TV is white', 'This is a car'],
['The TV is big', 'The desk is white'],
['She went to work', 'I drink coffee']],
5: [['She drinks coffee', 'This is a car'],
['She went to work', 'I went to work'],
['The desk is white', 'The TV is big'],
['I drink coffee', 'The TV is white'],
['The desk is red', 'This is a plane']],
6: [['This is a plane', 'The TV is big'],
['She drinks coffee', 'I went to work'],
['She went to work', 'This is a car'],
['I drink coffee', 'The TV is white'],
['The desk is white', 'The desk is red']],
7: [['The desk is red', 'She drinks coffee'],
['This is a car', 'The TV is white'],
['The TV is big', 'She went to work'],
['I went to work', 'The desk is white'],
['This is a plane', 'I drink coffee']],
8: [['I went to work', 'She went to work'],
['I drink coffee', 'The desk is white'],
['The TV is big', 'The TV is white'],
['The desk is red', 'She drinks coffee'],
['This is a plane', 'This is a car']],
9: [['She went to work', 'The TV is big'],
['I went to work', 'The TV is white'],
['I drink coffee', 'The desk is white'],
['This is a plane', 'This is a car'],
['She drinks coffee', 'The desk is red']]}
EDIT : To get all the contents, change the first few lines to this:编辑:要获取所有内容,请将前几行更改为:
import random
from itertools import chain
# groups by type 'H' and 'L', makes list, and assigns to corresponding variables
df2 = df.copy()
df2['joined'] = df.astype(str).agg(', '.join,1)
H, L = df2.groupby('Type')['joined'].agg(list)
Reference:参考:
How does zip(*[iter(s)]*n) work in Python? zip(*[iter(s)]*n) 在 Python 中如何工作?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.