簡體   English   中英

在 Python 中隨機分布一個數據集

[英]Randomly spread a dataset in Python

我想以 60%、20%、20% 的比例隨機分成三個數據集。 我已經編碼了一些東西,但問題是它可以隨機選擇兩次相同的值。 編碼:

mask_60 = np.random.choice([False, True], len(ds2), p=[0.4,0.6])
mask_20 = np.random.choice([False, True], len(ds2), p=[0.8,0.2])

ds2_train = ds2[mask_60]
ds2_test = ds2[mask_20]
ds2_val = ds2[mask_20]

有什么建議嗎?

謝謝!

使用 sklearn 的train_test_split ,文檔here 首先將數據集分成 60% 和 40%,然后將 40% 分成兩半。

from sklearn.cross_validation import train_test_split
set_1, temp = train_test_split(ds2, train_size=0.6, random_state=42)
set_2, set_3 = train_test_split(temp, train_size=0.5, random_state=42)

您還可以指定種子值以使您的樣品可重現。

感謝@Jonah

%reset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn.ensemble as sk
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import Imputer

dataset=pd.read_csv('raw_data/data_set_nass_ges_person_2014.csv',sep=";");
   ds=dataset

ds2 = ds[ds["INJ_SEV"] <=4 ] #to clean useless variables from 5 to 9 , MAXSEV_IM

train, temp = train_test_split(ds2, train_size = 0.8) #training set

test, val = train_test_split(temp, test_size=0.5) #test set, validation set

rfc = sk.RandomForestClassifier(n_estimators=500, oob_score=True)

train_data = train[train.columns[1:-1]] #input
train_truth = train["INJ_SEV"] #target

train_data = Imputer().fit_transform(train_data)
train_truth = Imputer().fit_transform(train_truth) #to solve the problem of 32bit vs 64bit

model = rfc.fit(train_data, train_truth) # Here appears the problem

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM