[英]TypeError: Level type mismatch: 0.2. When splitting data into training, validating and testing sets
美好的一天,
我试图在不使用scikit-learn
的情况下训练,验证和测试数据。 我希望将数据分为以下示例:
但是,当我尝试拆分数据时,出现以下错误:
TypeError: Level type mismatch: 6.0
我需要帮助以了解我在这里做错了什么。 样本数据和目标数据是x_data
这是一个数据帧和y_data
分别熊猫系列。 这是我在下面尝试的代码:
def train_valid_test(x_data y_data, train_split, valid_split, test_split):
""" Parameters
x_data: the input data
y_data: target values
train_split: the portion used for training data
valid_split: the portion used for validating data
test_split: the portion used for testing data
"""
# setting sizes to split the data into training validating and testing samples accordingly
train_size = float(len(all_x)*0.7)
valid_size = float(len(all_x)*0.2)
test_size = float(len(x_prime)*0.1)
# Creating Training and Validation sets
x_train, x_prime = x_data[:valid_size], x_data[valid_size:]
y_train, y_prime = y_data[:valid_size], y_data[valid_size:]
# Creating test sets
x_valid, x_test = x_prime[:test_size], x_prime[test_size:]
y_valid, y_test = y_prime[:test_size], y_prime[test_size:]
# Return the samples
return X_train, X_valid, X_test, y_train, y_valid, y_test
您正在尝试使用float
对pandas数据帧进行切片,因为以下操作会为训练,验证和测试数据的大小生成非整数值:
train_size = float(len(all_x)*0.7)
valid_size = float(len(all_x)*0.2)
test_size = float(len(x_prime)*0.1)
请注意,您的分割不正确; 训练集包含验证和测试集的所有数据点,而验证集包含测试集的所有实例。 另外,您永远不要依赖不会影响数据集的拆分。 以下功能将为您解决问题。
import numpy as np
import pandas as pd
def train_valid_test(df, train_split=.7, valid_split=.2, seed=None):
np.random.seed(seed)
perm = np.random.permutation(df.index)
training_max_index = int(train_split * len(df.index))
validate_max_index = int(valid_split * len(df.index)) + training_max_index
training = df.ix[perm[:training_max_index]]
validation = df.ix[perm[training_max_index:validate_max_index]]
test = df.ix[perm[validate_max_index:]]
return training, validation, test
如果要分别传递因变量( y
)和自变量( x
),则可以使用以下函数:
import numpy as np
import pandas as pd
def train_valid_test(x_data, y_data, train_split=.7, valid_split=.2, seed=None):
if len(x_data.index) != len(y_data.index):
raise Exception('x_data and y_data must contain the same number of data points'
np.random.seed(seed)
perm = np.random.permutation(x_data.index)
x_data = x_data.reindex(perm)
y_data = y_data.reindex(perm)
training_max_index = int(train_split * len(x_data.index))
validate_max_index = int(valid_split * len(x_data.index)) + training_max_index
X_train, y_train = x_data[:training_max_index], y_data[:training_max_index]
X_valid, y_valid = x_data[:validate_max_index], y_data[:validate_max_index]
X_test, y_test = x_data[validate_max_index:], y_data[validate_max_index:]
return X_train, X_valid, X_test, y_train, y_valid, y_test
索引必须是整数。 可以尝试:
train_size = int(len(all_x)*0.7)
valid_size = int(len(all_x)*0.2)
test_size = int(len(x_prime)*0.1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.