简体   繁体   English

Python:如何将数据采样到Test and Train数据集中?

[英]Python: How to sample data into Test and Train datasets?

I have been using a CSV data to implement my scripts and wanted to sample the data into two datasets: 我一直在使用CSV数据来实现我的脚本,并希望将数据采样到两个数据集中:

  1. Test Data 测试数据
  2. Train Data 训练数据

i want to sample the data sets in 85% and 15% divisions and want to output two CSV files Test.csv and Train.csv 我想以85%和15%的分区对数据集进行采样,并希望输出两个CSV文件Test.csv和Train.csv

i want it to do in base Python and do not want to use any other external module like Numpy, SciPy, Pandas or Scikitlearn. 我希望它在基础Python中做,并且不想使用任何其他外部模块,如Numpy,SciPy,Pandas或Scikitlearn。 Can anyone help me out in random sampling of data by percentage. 任何人都可以帮助我按百分比随机抽样数据。 Moreover i will be provided with the datasets that may have random number of observations. 此外,我将获得可能具有随机观察数的数据集。 SO far i have just read about Pandas and various other modules to sample the data by percentage basis and have not got any concrete solution for my problem. 到目前为止,我刚刚阅读了关于熊猫和各种其他模块的数据,以百分比为基础对数据进行采样,并且没有针对我的问题找到任何具体的解决方案。

Moreover i want to retain the headers of the CSV in both the files. 此外,我想在两个文件中保留CSV的标题。 Because headers would make each row accessible and can be used in further analysis. 因为标题会使每一行都可访问,并可用于进一步分析。

Use random.shuffle to create a random permutation of your dataset and slice it as you wish: 使用random.shuffle创建数据集的随机排列并根据需要对其进行切片:

import random
random.shuffle(data)
train = data[:int(len(data)*0.85)]
test = data[len(train):]

Since you requested a specific solution to partition a potentially large CSV file into two files for training and test data, I'll also show how that could be done using a similar approach like the general method described above: 由于您请求了一个特定的解决方案,将一个可能很大的CSV文件分区为两个文件用于培训和测试数据,我还将展示如何使用类似于上述一般方法的方法来完成:

import random

# Count lines
with open('data.csv','r') as csvf:
    linecount = sum(1 for lines in csvf if line.strip() != '')

# Create index sets for training and test data
indices = list(range(linecount))
random.shuffle(indices)
ind_test = set(indices[:int(linecount*0.15)])
del indices

# Partition CSV file
with open('data.csv','r') as csvf, open('train.csv','w') as trainf, open('test.csv','w') as testf:
    i = 0
    for line in csvf:
        if line.strip() != '':
            if i in ind_test:
                testf.write(line.strip() + '\n')
            else:
                trainf.write(line.strip() + '\n')

Thereby, I assume that the CSV file contains one observation per row. 因此,我假设CSV文件每行包含一个观察。

This will create an accurate 85:15 split. 这将创建一个准确的85:15分割。 If less accurate partitions are okay for you, the solution of Peter Wood would be much more efficient. 如果不太准确的分区对你来说是好的,那么Peter Wood的解决方案会更有效率。

Use the random function in the random module to get a uniformly distributed random number between 0 and 1 . 使用随机模块中random函数得到01之间均匀分布的随机数。

If it's > .85 write to training data, else the test data. 如果> .85写入训练数据,否则测试数据。 See How do I simulate flip of biased coin in python? 请参阅如何在python中模拟偏置硬币的翻转? .

import random

with open(input_file) as data:
    with open(test_output, 'w') as test:
        with open(train_output, 'w') as train:
            header = next(data)
            test.write(header)
            train.write(header)
            for line in data:
                if random.random() > 0.85:
                    train.write(line)
                else:
                    test.write(line)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 python 中手动创建训练和测试数据集 - creating train and test datasets manually in python 如何将标准化应用于训练和测试数据集 - How to apply standardization to train and test datasets 如何从数据集中拆分训练、测试、有效数据并将其存储在 pickle 中 - How can i split the train, test, valid data from datasets and store it in pickle 基于基于训练和测试数据的 ARIMA 模型在 Python 中创建样本外预测 - Creating an out of sample forecast in Python based on ARIMA model built on train and test data 将 movielens 数据拆分为训练验证测试数据集 - Splitting movielens data into train-validation-test datasets 在测试和训练数据集中使用基于时间的拆分来拆分数据 - Splitting data using time-based splitting in test and train datasets 如何在训练和测试不同数据集的情况下进行GridSearchCV? - How to do GridSearchCV with train and test being different datasets? 如何为线性回归和训练模型创建 tf.data.Datasets - How to create a tf.data.Datasets for linear regression and train model 如何使用 Python Numpy 中的 train_test_split 将数据拆分为训练、测试和验证数据集? 分裂不应该是随机的 - How to split data by using train_test_split in Python Numpy into train, test and validation data set? The split should not random 在 Python 中分离 Dataframes 以训练、测试和绘制数据图 - Separating Dataframes in Python to train, test, and graph the data
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM