I want to create a pandas udf
function for Pyspark in which I am using the scikit-learn train_test_split
function and returning a dataframe.
And I have a dataframe like this: But in my dataframe, there is no id column. So I have added id column in dataframe
This is what I have done.
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def load_dataset(df):
X = df[X_columns]
y = df[y_columns]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
df_sample_0 = pd.concat([y_test, X_test], axis=1)
return df_sample_0
And this is how I am applying groupby:
sample_df = final_df_spark.groupby("id").apply(load_dataset)
But I am getting this error:
ValueError: With n_samples=1, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
How would I go about fixing this error?
I just replaced id
column to Age
column.
sample_df = final_df_spark.groupby("Age").apply(load_dataset)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.