[英]Python: How to sample data into Test and Train datasets?
I have been using a CSV data to implement my scripts and wanted to sample the data into two datasets: 我一直在使用CSV数据来实现我的脚本,并希望将数据采样到两个数据集中:
i want to sample the data sets in 85% and 15% divisions and want to output two CSV files Test.csv and Train.csv 我想以85%和15%的分区对数据集进行采样,并希望输出两个CSV文件Test.csv和Train.csv
i want it to do in base Python and do not want to use any other external module like Numpy, SciPy, Pandas or Scikitlearn. 我希望它在基础Python中做,并且不想使用任何其他外部模块,如Numpy,SciPy,Pandas或Scikitlearn。 Can anyone help me out in random sampling of data by percentage.
任何人都可以帮助我按百分比随机抽样数据。 Moreover i will be provided with the datasets that may have random number of observations.
此外,我将获得可能具有随机观察数的数据集。 SO far i have just read about Pandas and various other modules to sample the data by percentage basis and have not got any concrete solution for my problem.
到目前为止,我刚刚阅读了关于熊猫和各种其他模块的数据,以百分比为基础对数据进行采样,并且没有针对我的问题找到任何具体的解决方案。
Moreover i want to retain the headers of the CSV in both the files. 此外,我想在两个文件中保留CSV的标题。 Because headers would make each row accessible and can be used in further analysis.
因为标题会使每一行都可访问,并可用于进一步分析。
Use random.shuffle
to create a random permutation of your dataset and slice it as you wish: 使用
random.shuffle
创建数据集的随机排列并根据需要对其进行切片:
import random
random.shuffle(data)
train = data[:int(len(data)*0.85)]
test = data[len(train):]
Since you requested a specific solution to partition a potentially large CSV file into two files for training and test data, I'll also show how that could be done using a similar approach like the general method described above: 由于您请求了一个特定的解决方案,将一个可能很大的CSV文件分区为两个文件用于培训和测试数据,我还将展示如何使用类似于上述一般方法的方法来完成:
import random
# Count lines
with open('data.csv','r') as csvf:
linecount = sum(1 for lines in csvf if line.strip() != '')
# Create index sets for training and test data
indices = list(range(linecount))
random.shuffle(indices)
ind_test = set(indices[:int(linecount*0.15)])
del indices
# Partition CSV file
with open('data.csv','r') as csvf, open('train.csv','w') as trainf, open('test.csv','w') as testf:
i = 0
for line in csvf:
if line.strip() != '':
if i in ind_test:
testf.write(line.strip() + '\n')
else:
trainf.write(line.strip() + '\n')
Thereby, I assume that the CSV file contains one observation per row. 因此,我假设CSV文件每行包含一个观察。
This will create an accurate 85:15 split. 这将创建一个准确的85:15分割。 If less accurate partitions are okay for you, the solution of Peter Wood would be much more efficient.
如果不太准确的分区对你来说是好的,那么Peter Wood的解决方案会更有效率。
Use the random
function in the random module to get a uniformly distributed random number between 0
and 1
. 使用随机模块中的
random
函数得到0
到1
之间均匀分布的随机数。
If it's > .85
write to training data, else the test data. 如果
> .85
写入训练数据,否则测试数据。 See How do I simulate flip of biased coin in python? 请参阅如何在python中模拟偏置硬币的翻转? .
。
import random
with open(input_file) as data:
with open(test_output, 'w') as test:
with open(train_output, 'w') as train:
header = next(data)
test.write(header)
train.write(header)
for line in data:
if random.random() > 0.85:
train.write(line)
else:
test.write(line)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.