ValueError when using scikit-learn train_test_split function in PySpark Pandas UDF

Question

I want to create a pandas udf function for Pyspark in which I am using the scikit-learn train_test_split function and returning a dataframe.

And I have a dataframe like this: But in my dataframe, there is no id column. So I have added id column in dataframe

This is what I have done.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(df):
    X = df[X_columns]
    y = df[y_columns]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
    
    df_sample_0 = pd.concat([y_test, X_test], axis=1)
 
    return df_sample_0

And this is how I am applying groupby:

sample_df = final_df_spark.groupby("id").apply(load_dataset)

But I am getting this error:

ValueError: With n_samples=1, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

How would I go about fixing this error?

Answer 1

I just replaced id column to Age column.

sample_df = final_df_spark.groupby("Age").apply(load_dataset)

ValueError when using scikit-learn train_test_split function in PySpark Pandas UDF

Question

1 answers

solution1
0 ACCPTED 2021-02-02 08:32:43

ValueError when using scikit-learn train_test_split function in PySpark Pandas UDF

Question

1 answers

solution1 0 ACCPTED 2021-02-02 08:32:43

solution1
0 ACCPTED 2021-02-02 08:32:43