Trimming white space in only string columns of a DataFrame

Question

I'm trying to trim the left and right white spaces in any given DataFrame, but only in string columns (so as to not alter the schema of the DataFrame). Another solution would be to trim all columns, and infer the schema or replace the schema after trimming. But I'm not sure how to do that either... this is what I'm doing now.

from pyspark.sql.functions import col

mmDF.printSchema()
columnList = [item[0] for item in mmDF.dtypes if item[1].startswith('string')]

mmDF = mmDF.withColumn(col, func.ltrim(func.rtrim(mmDF[col] for mmDF_col in columnList)))

mmDF.show()

mmDF.printSchema()

Trimming line causes error:

TypeError: Invalid argument, not a string or column: <generator object <genexpr> at 0x0000027D5C63E248> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Answer 1

Answer is found here . Essentially you are selecting string columns with the select_dtypes command found in pandas and then applying str.trim() over all subsetted columns.

Trimming white space in only string columns of a DataFrame

Question

1 answers

solution1
0 2020-07-29 16:48:02

Trimming white space in only string columns of a DataFrame

Question

1 answers

solution1 0 2020-07-29 16:48:02

solution1
0 2020-07-29 16:48:02