简体   繁体   中英

Trimming white space in only string columns of a DataFrame

I'm trying to trim the left and right white spaces in any given DataFrame, but only in string columns (so as to not alter the schema of the DataFrame). Another solution would be to trim all columns, and infer the schema or replace the schema after trimming. But I'm not sure how to do that either... this is what I'm doing now.

from pyspark.sql.functions import col

mmDF.printSchema()
columnList = [item[0] for item in mmDF.dtypes if item[1].startswith('string')]

mmDF = mmDF.withColumn(col, func.ltrim(func.rtrim(mmDF[col] for mmDF_col in columnList)))

mmDF.show()

mmDF.printSchema()

Trimming line causes error:

TypeError: Invalid argument, not a string or column: <generator object <genexpr> at 0x0000027D5C63E248> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Answer is found here . Essentially you are selecting string columns with the select_dtypes command found in pandas and then applying str.trim() over all subsetted columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM