简体   繁体   English

将文件夹拆分为训练集和测试集

[英]Splitting folders into training and testing set

I have 5 folders for Enron email dataset.我有 5 个文件夹用于安然电子邮件数据集。 I want to split enron1, enron3, enron5 into Training set and enron2,enron4 as Testing set in python.我想在 python 中将 enron1、enron3、enron5 拆分为训练集和 enron2、enron4 作为测试集。 I can load full dataset and split.我可以加载完整的数据集并拆分。 but can't put as mentioned earlier.但不能像前面提到的那样放。

for i in range(1,6):
    # folder containing the 2 categories of documents in individual folders.
    movie_data = load_files(f"/Users/mehedihasan/Desktop/Study/SEM6/COMP723 Data Mining & Knowledge Engineering/Assignment/email data/enron{i}") 
    X = np.append(X, movie_data.data)
    y = np.append(y, movie_data.target)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Maybe use for i in [1,3,5]: and for i in [2, 4]: instead of range(1, 6)也许使用for i in [1,3,5]:for i in [2, 4]:而不是range(1, 6)

for i in [1,3,5]:
    # ... code ..
    X_train = ...
    y_train = ...

for i in [2, 4]:
    # ... code ..
    X_test = ...
    y_test = ...

BTW:顺便提一句:

If you have more folders then you can use如果您有更多文件夹,则可以使用

  • range(1, n, 2) to get 1, 3, 5, 7, 9, ... range(1, n, 2)得到1, 3, 5, 7, 9, ...
  • range(2, n, 2) to get 2, 4, 6, 8, 10, ... range(2, n, 2)得到2, 4, 6, 8, 10, ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 训练和测试拆分标记 - Training and Testing Splitting Tagging 随机分配训练和测试数据 - Randomly splitting training and testing data 我将数据拆分为测试集和训练集,错误是“找到样本数量不一致的输入变量:[1000, 23486]” - i am splitting the data into testing and training set, the error is 'Found input variables with inconsistent number of samples: [1000, 23486]' 是否有python函数将数据集分为训练,验证和测试? - Is there an python function for splitting the dataset into training, validation and testing? 分割数据集以逐行训练和测试 - Splitting dataset for training and testing row wise 随机分割数据以进行此功能的训练和测试 - Randomize the splitting of data for training and testing for this function 有条件地将数据拆分为训练和测试(Pandas) - Conditional splitting the data into training and testing (Pandas) 在 R 中拆分为训练和测试集? - Split into training and testing set in R? Tensorflow 将数据集拆分为训练和测试导致瓶颈/缓慢 - Tensorflow splitting dataset into training and testing causes bottleneck/slow 制作 Keras model 时将数据拆分为训练、测试和评估 - Splitting data to training, testing and valuation when making Keras model
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM