[英]How to randomly delete observations in particular columns in dataset in python
[英]Randomly spread a dataset in Python
我想以 60%、20%、20% 的比例隨機分成三個數據集。 我已經編碼了一些東西,但問題是它可以隨機選擇兩次相同的值。 編碼:
mask_60 = np.random.choice([False, True], len(ds2), p=[0.4,0.6])
mask_20 = np.random.choice([False, True], len(ds2), p=[0.8,0.2])
ds2_train = ds2[mask_60]
ds2_test = ds2[mask_20]
ds2_val = ds2[mask_20]
有什么建議嗎?
謝謝!
使用 sklearn 的train_test_split
,文檔here 。 首先將數據集分成 60% 和 40%,然后將 40% 分成兩半。
from sklearn.cross_validation import train_test_split
set_1, temp = train_test_split(ds2, train_size=0.6, random_state=42)
set_2, set_3 = train_test_split(temp, train_size=0.5, random_state=42)
您還可以指定種子值以使您的樣品可重現。
感謝@Jonah
%reset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.ensemble as sk
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import Imputer
dataset=pd.read_csv('raw_data/data_set_nass_ges_person_2014.csv',sep=";");
ds=dataset
ds2 = ds[ds["INJ_SEV"] <=4 ] #to clean useless variables from 5 to 9 , MAXSEV_IM
train, temp = train_test_split(ds2, train_size = 0.8) #training set
test, val = train_test_split(temp, test_size=0.5) #test set, validation set
rfc = sk.RandomForestClassifier(n_estimators=500, oob_score=True)
train_data = train[train.columns[1:-1]] #input
train_truth = train["INJ_SEV"] #target
train_data = Imputer().fit_transform(train_data)
train_truth = Imputer().fit_transform(train_truth) #to solve the problem of 32bit vs 64bit
model = rfc.fit(train_data, train_truth) # Here appears the problem
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.