简体   繁体   中英

ValueError when using scikit-learn train_test_split function in PySpark Pandas UDF

I want to create a pandas udf function for Pyspark in which I am using the scikit-learn train_test_split function and returning a dataframe.

And I have a dataframe like this: 在此处输入图像描述 But in my dataframe, there is no id column. So I have added id column in dataframe

This is what I have done.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(df):
    X = df[X_columns]
    y = df[y_columns]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
    
    df_sample_0 = pd.concat([y_test, X_test], axis=1)
 
    return df_sample_0

And this is how I am applying groupby:

sample_df = final_df_spark.groupby("id").apply(load_dataset)

But I am getting this error:

ValueError: With n_samples=1, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

How would I go about fixing this error?

I just replaced id column to Age column.

sample_df = final_df_spark.groupby("Age").apply(load_dataset)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM