Standard Scaling it taking too much time in pyspark dataframe

Question

I've tried standard scaler from spark.ml with the following function:

def standard_scale_2(df, columns_to_scale):
    """
    Args:
    df : spark dataframe
    columns_to_scale : list of columns to standard scale
    """
    from pyspark.ml.feature import StandardScaler
    from pyspark.ml import Pipeline
    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.functions import vector_to_array
    
    # UDF for converting column type from vector to double type
    unlist = udf(lambda x: round(float(list(x)[0]),3), DoubleType())
    
    # Iterating over columns to be scaled
    for i in columns_to_scale:

      # VectorAssembler Transformation - Converting column to vector type
      assembler = VectorAssembler(inputCols=[i],outputCol=i+"_Vect")

      # MinMaxScaler Transformation
      scaler = StandardScaler(inputCol=i+"_Vect", outputCol=i+"_Scaled")

      # Pipeline of VectorAssembler and MinMaxScaler
      pipeline = Pipeline(stages=[assembler, scaler])

      # Fitting pipeline on dataframe
      df = pipeline.fit(df).transform(df).withColumn(i+"_Scaled", unlist(i+"_Scaled")).drop(i+"_Vect",i).withColumnRenamed(i+"_scaled",i)
    return df

Instead of iterating for each column, I've also tried scaling all the columns at once, but didn't work either.

I've also tried standard scaling with this simple udf:

for column in columns_to_standard_scale:
         sdf = sdf.withColumn(column,
                       F.col(column) / sdf.agg(stddev_samp(column)).first()[0])
         print(column, " completed")

I'm using spark cluster with c5d.2xlarge (16 gb memory 8 cores) nodes (max 30 nodes) in databricks.
And size of the spark dataframe is only 100k. There are around 90 columns which I need to scale. But it's taking around 10 minutes per column to scale and when I was trying to scale all the columns in one go, script didn't complete even after 2 hours. But the same dataframe in pandas is hardly taking 2 minutes with sklearn standard scaler.

I don't think there is any issue with the code or dataframe, but I'm missing something which is creating bottlenecks and it's taking too much time for this simple operation.

Answer 1

I came across a similar problem when I tried to build a pipeline of column scaling. In my dataset, there were 400 features, first I thought of adding them as a separate pipeline step:

stages = []    
for i,  col_to_scale in enumerate(scallarInputs):
            col_scaler = StandardScaler(inputCol=col_to_scale, 
            outputCol=col_to_scale+"_scaled",withStd=True, withMean=withMean)
            stages += [col_scaler]

    pipeline = Pipeline(stages = stages)
    pipelineModel = pipeline.fit(df)

For my dataset, It took six hours to run!

Then I decided to do a vector assemble first and then scale it:

stages = []
assemblerInputs = df.columns
assemblerInputs = [column for column in assemblerInputs if column not in columns_to_remove_from_assembler]
#add vector assembler
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features_nonscaled")
stages += [assembler]

col_scaler = StandardScaler(inputCol='features_nonscaled', outputCol='features',withStd=True, withMean=False)
stages += [col_scaler]

pipeline = Pipeline(stages = stages)
assemblerModel = pipeline.fit(df)

Everything took 17sec!

I hope it's helpful to someone

Answer 2

There wasn't any issue with the standard scaling spark code. It was spark's lazy evaluation which I wasn't aware of earlier and I thought there is something wrong with this standard scaling function.
Actually lazy evaluation means spark will wait until the very last moment to execute the graph of computation instructions.
And there was a back-filling function that I executed just before executing this standard scaler function. That back-filling function was actually the bottleneck because when I commented that part, my spark application was running fine. Also that back-filling function was having cross-join, groubBy etc wide transformations which was very inefficient as it was causing a lot of shuffling operations. So, I modified that function, consequently, my whole spark application is finished within 30 sec.

Standard Scaling it taking too much time in pyspark dataframe

Question

2 answers

solution1
1 2021-11-22 13:31:55

solution2
0 2021-07-24 07:10:01

Standard Scaling it taking too much time in pyspark dataframe

Question

2 answers

solution1 1 2021-11-22 13:31:55

solution2 0 2021-07-24 07:10:01

solution1
1 2021-11-22 13:31:55

solution2
0 2021-07-24 07:10:01