How to create stratified split training, validation, and test set on pyspark?

Question

I have a small dataset (140K) that I would like to split into validation set, validation set test set using the target variable and another field to straitified those splits.

Answer 1

In Pyspark you can use randomSplit() function to divide the dataset into train and test dataset. It can take upto two argument that are weights and seed.We use Seed because we want same output.In weights you can specify the floating number.If it doesnt sums to 1 it will normalize the weights.It is used for specify what percentage of data will go in train,validation and test part.

Sample Code

data.randomSplit([0.8,0.1,0.1],785)

How to create stratified split training, validation, and test set on pyspark?

Question

1 answers

solution1
0 2019-09-20 08:31:42

How to create stratified split training, validation, and test set on pyspark?

Question

1 answers

solution1 0 2019-09-20 08:31:42

solution1
0 2019-09-20 08:31:42