简体   繁体   中英

Why is my Code Repo warning me about using withColumn in a for/while loop?

I'm noticing my code repo is warning me that using withColumn in a for/while loop is an antipattern. Why is this not recommended? Isn't this a normal use of the PySpark API?

We've noticed in practice that using withColumn inside a for/while loop leads to poor query planning performance as discussed over here . This is not obvious when writing code for the first time in Foundry, so we've built a feature to warn you about this behavior.

We'd recommend you follow the Scala docs recommendation :

withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.

Since
2.0.0

Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

ie

my_other_columns = [...]

df = df.select(
  *[col_name for col_name in df.columns if col_name not in my_other_columns],
  *[F.col(col_name).alias(col_name + "_suffix") for col_name in my_other_columns]
)

is vastly preferred over

my_other_columns = [...]

for col_name in my_other_columns:
  df = df.withColumn(
    col_name + "_suffix",
    F.col(col_name)
  )

While this may technically be a normal use of the PySpark API, it will result in poor query planning performance if withColumn is called too many times in your job, so we'd prefer you avoid this problem entirely.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM