简体   繁体   English

训练测试拆分一个句子列表

[英]Train test split a list of sentence

I have a list of sentences.我有一个句子列表。 I want to randomly separate into 80% and 20%, it looks like this:我想随机分成80%和20%,看起来是这样的:

['Hi.',
 'Hi.',
 'Run!',
 'Wow!',
 'Wow!',
 'Fire!',
 'Help!',
 'Help!',
 'Stop!',
 'Wait!',
 'Go on.',
 'Hello!',
 'I ran.',
 'I see.',
 'I see.',
 'I try.',
 'I won!',...]

I was thinking using a mask我在考虑用面膜

import random
mask = [0] * 4000 + [1] * 16000
random.shuffle(mask)

But it is not like a data frame.但它不像数据框。 and I tried我试过

percent=80
bol_mask =[random.randrange(100) < percent for i in range(100)]

Cant really apply boolean to sentences不能真正将布尔值应用于句子

Also the separation mask must be kept, and will later apply to another list in German, which is the corresponding translation.此外,分隔掩码必须保留,稍后将应用于另一个德语列表,即相应的翻译。

it looks like this它看起来像这样

array([[ 553,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3430, 1114,    6,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [1115,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3431,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3432,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [2459,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3433,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [1533, 3434,    6,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [2460,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [ 394,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0]],
      dtype=int32)

My question is how to apply mask to a list of sentences?我的问题是如何将掩码应用于句子列表? and keep the same split and apply to the corresponding ndarray?并保持相同的拆分并应用于相应的ndarray?

If using scikit-learn is an option, you can just use train_test_split method as the following:如果使用scikit-learn是一个选项,您可以使用train_test_split方法如下:

>>> from sklearn.model_selection import train_test_split
>>> print(x)
>>> x
['Hi.', 'Hi.', 'Run!', 'Wow!', 'Wow!', 'Fire!', 'Help!', 'Help!', 'Stop!', 'Wait!']

>>> len(x)
10
>>>  x1
array([[ 553,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3430, 1114,    6,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [1115,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3431,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3432,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [2459,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [3433,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [1533, 3434,    6,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [2460,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0],
       [ 394,    6,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0]])
>>> x1.shape
(10, 20)

#assuming x, x1 have same length train test split should work fine.
>>> train, test, train_german, test_german = train_test_split(x,x1, test_size=0.2, shuffle=True)
>>> len(train)
8
>>> len(test)
2
>>> len(train_german)
8
>>> len(test)
2

Actually I've solved it my self.其实我自己已经解决了。

bol_mask =[random.randrange(100) < 80 for i in range(20000)]
inv_mask = np.invert(bol_mask)

Eng_train =np.array(Eng)[bol_mask]
Eng_test =np.array(Eng)[inv_mask]
German_train = padded[bol_mask]
German_test = padded[inv_mask]

Thanks Grayrigel, an accept for your effort in helping感谢 Grayrigel,接受您的帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM