Python：如何将数据采样到Test and Train数据集中？

Question

I have been using a CSV data to implement my scripts and wanted to sample the data into two datasets: 我一直在使用CSV数据来实现我的脚本，并希望将数据采样到两个数据集中：

Test Data 测试数据
Train Data 训练数据

i want to sample the data sets in 85% and 15% divisions and want to output two CSV files Test.csv and Train.csv 我想以85％和15％的分区对数据集进行采样，并希望输出两个CSV文件Test.csv和Train.csv

i want it to do in base Python and do not want to use any other external module like Numpy, SciPy, Pandas or Scikitlearn. 我希望它在基础Python中做，并且不想使用任何其他外部模块，如Numpy，SciPy，Pandas或Scikitlearn。 Can anyone help me out in random sampling of data by percentage. 任何人都可以帮助我按百分比随机抽样数据。 Moreover i will be provided with the datasets that may have random number of observations. 此外，我将获得可能具有随机观察数的数据集。 SO far i have just read about Pandas and various other modules to sample the data by percentage basis and have not got any concrete solution for my problem. 到目前为止，我刚刚阅读了关于熊猫和各种其他模块的数据，以百分比为基础对数据进行采样，并且没有针对我的问题找到任何具体的解决方案。

Moreover i want to retain the headers of the CSV in both the files. 此外，我想在两个文件中保留CSV的标题。 Because headers would make each row accessible and can be used in further analysis. 因为标题会使每一行都可访问，并可用于进一步分析。

Answer 1

Use random.shuffle to create a random permutation of your dataset and slice it as you wish: 使用random.shuffle创建数据集的随机排列并根据需要对其进行切片：

import random
random.shuffle(data)
train = data[:int(len(data)*0.85)]
test = data[len(train):]

Since you requested a specific solution to partition a potentially large CSV file into two files for training and test data, I'll also show how that could be done using a similar approach like the general method described above: 由于您请求了一个特定的解决方案，将一个可能很大的CSV文件分区为两个文件用于培训和测试数据，我还将展示如何使用类似于上述一般方法的方法来完成：

import random

# Count lines
with open('data.csv','r') as csvf:
    linecount = sum(1 for lines in csvf if line.strip() != '')

# Create index sets for training and test data
indices = list(range(linecount))
random.shuffle(indices)
ind_test = set(indices[:int(linecount*0.15)])
del indices

# Partition CSV file
with open('data.csv','r') as csvf, open('train.csv','w') as trainf, open('test.csv','w') as testf:
    i = 0
    for line in csvf:
        if line.strip() != '':
            if i in ind_test:
                testf.write(line.strip() + '\n')
            else:
                trainf.write(line.strip() + '\n')

Thereby, I assume that the CSV file contains one observation per row. 因此，我假设CSV文件每行包含一个观察。

This will create an accurate 85:15 split. 这将创建一个准确的85:15分割。 If less accurate partitions are okay for you, the solution of Peter Wood would be much more efficient. 如果不太准确的分区对你来说是好的，那么Peter Wood的解决方案会更有效率。

Answer 2

Use the random function in the random module to get a uniformly distributed random number between 0 and 1 . 使用随机模块中的random函数得到0到1之间均匀分布的随机数。

If it's > .85 write to training data, else the test data. 如果> .85写入训练数据，否则测试数据。 See How do I simulate flip of biased coin in python? 请参阅如何在python中模拟偏置硬币的翻转？ . 。

import random

with open(input_file) as data:
    with open(test_output, 'w') as test:
        with open(train_output, 'w') as train:
            header = next(data)
            test.write(header)
            train.write(header)
            for line in data:
                if random.random() > 0.85:
                    train.write(line)
                else:
                    test.write(line)

Python：如何将数据采样到Test and Train数据集中？

问题描述

2 个解决方案

解决方案1
2 2016-03-15 11:23:13

解决方案2
2 已采纳 2016-03-15 11:46:05

Python：如何将数据采样到Test and Train数据集中？

问题描述

2 个解决方案

解决方案1 2 2016-03-15 11:23:13

解决方案2 2 已采纳 2016-03-15 11:46:05

解决方案1
2 2016-03-15 11:23:13

解决方案2
2 已采纳 2016-03-15 11:46:05