简体   繁体   中英

Pyspark dataframe replace functions: How to work with special characters in column names?

I need to replace all blank strings in dataframe with null . Ideally, replace function of pyspark.sql.DataFrameNaFunctions would do the trick. The code would be df_test = df_test.replace('', None)

However, when I have special characters (dot) in column names, then it fails with error AnalysisException: Cannot resolve column name "NL.Col1" among (NL.Col1, Col2); did you mean to quote the `NL.Col1` column? AnalysisException: Cannot resolve column name "NL.Col1" among (NL.Col1, Col2); did you mean to quote the `NL.Col1` column?

I have tried:-

  • Not passing any list to replace function so that it considers all.
  • Passing a list of columns wrapped in `` (as suggested by error)
  • Passing list of columns which do not have special character to test out functionality. Still same error occurs.

Having dot in column name is crucial for downstream task and I should not remove or substitute it.

Below is a sample pyspark code in case you want to test it.

# Create df
df_test = spark.createDataFrame([['a','b'], ['',None], [None, None]], ['NL.Col1', 'Col2']) 
ls_map_col = df_test.columns
print(f"""Original column list: {ls_map_col} """)
ls_wrap_col = [f"`{i}`" for i in ls_map_col]
print(f"""Wrapping in `` looks like this: {ls_wrap_col} """)
# Demonstrate that select works fine with the list when column names are wrapped in ``
df_test = df_test.select(*ls_wrap_col)
df_test.display()

#### ERROR OCCURS HERE when special character like dot is present.
# Try replacing values in column 
df_test = df_test.replace('', None) #, ls_wrap_col
df_test.display()
df_test = df_test.dropna(how='all') # dropna also throws same error as replace
df_test.display()

错误截图

Preserving those periods in your object names is a bad idea. But if you're determined to do it, you can use regexp_replace. It will kind of suck, because you have to do it for every column you want to replace your empty string in.

df_test.withColumn("NL.Col1",regexp_replace("`NL.Col1`",'',None)).show()

A simple solution might be to rename the column to not include . in the name.

##Replace . with '_'
df_test.columns = df_test.columns.str.replace('.','_')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM