简体   繁体   English

将数据分成培训和测试

[英]split data into training and testing

I want to replicate this tutorial to classify two groups https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/ with different dataset but could not do that despite being hardly trying. 我想复制本教程,用不同的数据集对两组https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/进行分类,但尽管几乎没有尝试,但仍然无法做到。 I am new to programming so would appreciate any assistance or tips that could help. 我是编程新手,所以非常感谢任何有用的帮助或提示。

My dataset is small (240 files for each group), and files named 01 - 0240. 我的数据集很小(每组240个文件),文件名为01 - 0240。

It is around these lines of codes, I think. 我认为这是围绕这些代码行。

    if is_trian and filename.startswith('cv9'):
        continue
    if not is_trian and not filename.startswith('cv9'):
        continue

and also these 还有这些

            trainy = [0 for _ in range(900)] + [1 for _ in range(900)]
            save_dataset([trainX,trainy], 'train.pkl')

            testY = [0 for _ in range(100)] + [1 for _ in range(100)]
            save_dataset([testX,testY], 'test.pkl')

two errors were encountered so far: 到目前为止遇到两个错误:

Input arrays should have the same number of samples as target arrays. 输入数组应具有与目标数组相同的样本数。 Found 483 input samples and 200 target samples. 找到483个输入样本和200个目标样本。

Unable to open file (unable to open file: name = 'model.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) 无法打开文件(无法打开文件:name ='model.h5',errno = 2,错误消息='没有这样的文件或目录',flags = 0,o_flags = 0)

I would really appreciate any prompt help. 我真的很感激任何迅速的帮助。

Thanks in advance. 提前致谢。

// Part of the code for more clarity. //部分代码更清晰。 // //

# load all docs in a directory
def process_docs(directory, is_trian):
    documents = list()
    # walk through all files in the folder
    for filename in listdir(directory):
        # skip any transcript in the test set

I want to add an argument below to indicate whether to process the training or testing files, just as mentioned in the tutorial. 我想在下面添加一个参数来指示是否处理培训或测试文件,就像在教程中提到的那样。 Or if there's another way please share it 或者,如果有另一种方式,请分享

        if is_trian and filename.startswith('----'):
            continue
        if not is_trian and not filename.startswith('----'):
            continue
        # create the full path of the file to open
        path = directory + '/' + filename
        # load the doc
        doc = load_doc(path)
        # clean doc
        tokens = clean_doc(doc)
        # add to list
        documents.append(tokens)
    return documents

# save a dataset to file
def save_dataset(dataset, filename):
    dump(dataset, open(filename, 'wb'))
    print('Saved: %s' % filename)

# load all training transcripts
healthy_docs = process_docs('PathToData/healthy', True)
sick_docs = process_docs('PathToData/sick', True)
trainX = healthy_docs + sick_docs
trainy = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]
save_dataset([trainX,trainy], 'train.pkl')

# load all test transcripts
healthy_docs = process_docs('PathToData/healthy', False)
sick_docs = process_docs('PathToData/sick', False)
testX = healthy_docs + sick_docs
testY = [0 for _ in range(len( healthy_docs ))] + [1 for _ in range(len( sick_docs ))]

save_dataset([testX,testY], 'test.pkl')

You should post more of your code, but it sounds like your problem is curating the data. 你应该发布更多的代码,但听起来你的问题是策划数据。 Say you have 240 files in a folder called 'healthy' and 240 files in a folder called 'sick'. 假设您在名为'healthy'的文件夹中有240个文件,在名为'sick'的文件夹中有240个文件。 Then you need to label all the healthy people with label 0 and all the sick people with label 1. Try something like: 然后你需要标记所有标签为0的健康人和所有标签为1的病人。尝试以下方法:

from glob import glob 
from sklearn.model_selection import train_test_split

#get the filenames for healthy people 
xhealthy = [ fname for fname in glob( 'pathToData/healthy/*' )]

#give healthy people label of 0
yhealthy = [ 0 for i in range( len( xhealthy ))]

#get the filenames of sick people
xsick    = [ fname for fname in glob( 'pathToData/sick/*')]

#give sick people label of 1
ysick    = [ 1 for i in range( len( xsick ))]

#combine the data 
xdata = xhealthy + xsick 
ydata = yhealthy + ysick 

#create the training and test set 
X_train, X_test, y_train, y_test = train_test_split(xdata, ydata, test_size=0.1)

Then train your models with X_train, Y_train and test it with X_test, Y_test - keeping in mind that your X_data are just file names that need to still need processing. 然后使用X_train,Y_train训练模型并使用X_test,Y_test进行测试 - 请记住,您的X_data只是需要处理的文件名。 The more code you post the more people can help with your question. 发布的代码越多,人们就可以帮助解决您的问题。

I was able to solve the problem by separating the dataset into train and test sets manually and then labelling each set alone. 通过手动将数据集分成训练集和测试集,然后单独标记每个集,我能够解决问题。 My current dataset is so small, so I will keep looking for a better solution for large datasets once I have the capacity. 我当前的数据集非常小,所以一旦有了容量,我就会继续为大型数据集寻找更好的解决方案。 Provided to close the question. 提供关闭问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM