简体   繁体   中英

How to create stratified split training, validation, and test set on pyspark?

I have a small dataset (140K) that I would like to split into validation set, validation set test set using the target variable and another field to straitified those splits.

In Pyspark you can use randomSplit() function to divide the dataset into train and test dataset. It can take upto two argument that are weights and seed.We use Seed because we want same output.In weights you can specify the floating number.If it doesnt sums to 1 it will normalize the weights.It is used for specify what percentage of data will go in train,validation and test part.

Sample Code

data.randomSplit([0.8,0.1,0.1],785)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM