简体   繁体   English

Pyspark dataframe 替换函数:如何处理列名中的特殊字符?

[英]Pyspark dataframe replace functions: How to work with special characters in column names?

I need to replace all blank strings in dataframe with null .我需要将dataframe 中的所有空白字符串替换为 null Ideally, replace function of pyspark.sql.DataFrameNaFunctions would do the trick.理想情况下,将 function 替换为 pyspark.sql.DataFrameNaFunctions 即可。 The code would be df_test = df_test.replace('', None)代码将是df_test = df_test.replace('', None)

However, when I have special characters (dot) in column names, then it fails with error AnalysisException: Cannot resolve column name "NL.Col1" among (NL.Col1, Col2); did you mean to quote the `NL.Col1` column?但是,当我在列名中有特殊字符(点)时,它会失败并出现错误AnalysisException: Cannot resolve column name "NL.Col1" among (NL.Col1, Col2); did you mean to quote the `NL.Col1` column? AnalysisException: Cannot resolve column name "NL.Col1" among (NL.Col1, Col2); did you mean to quote the `NL.Col1` column?

I have tried:-我努力了:-

  • Not passing any list to replace function so that it considers all.不传递任何列表来替换 function 以便它考虑所有。
  • Passing a list of columns wrapped in `` (as suggested by error)传递包含在 `` 中的列列表(如错误建议的那样)
  • Passing list of columns which do not have special character to test out functionality.传递没有特殊字符的列列表来测试功能。 Still same error occurs.仍然发生同样的错误。

Having dot in column name is crucial for downstream task and I should not remove or substitute it.列名中有点对于下游任务至关重要,我不应该删除或替换它。

Below is a sample pyspark code in case you want to test it.下面是一个示例 pyspark 代码,以备您测试时使用。

# Create df
df_test = spark.createDataFrame([['a','b'], ['',None], [None, None]], ['NL.Col1', 'Col2']) 
ls_map_col = df_test.columns
print(f"""Original column list: {ls_map_col} """)
ls_wrap_col = [f"`{i}`" for i in ls_map_col]
print(f"""Wrapping in `` looks like this: {ls_wrap_col} """)
# Demonstrate that select works fine with the list when column names are wrapped in ``
df_test = df_test.select(*ls_wrap_col)
df_test.display()

#### ERROR OCCURS HERE when special character like dot is present.
# Try replacing values in column 
df_test = df_test.replace('', None) #, ls_wrap_col
df_test.display()
df_test = df_test.dropna(how='all') # dropna also throws same error as replace
df_test.display()

错误截图

Preserving those periods in your object names is a bad idea.在您的 object 名称中保留这些句点是个坏主意。 But if you're determined to do it, you can use regexp_replace.但是,如果您下定决心要这样做,可以使用 regexp_replace。 It will kind of suck, because you have to do it for every column you want to replace your empty string in.这有点糟糕,因为您必须为要替换空字符串的每一列都这样做。

df_test.withColumn("NL.Col1",regexp_replace("`NL.Col1`",'',None)).show()

A simple solution might be to rename the column to not include .一个简单的解决方案可能是将列重命名为 not include . in the name.在名字里。

##Replace . with '_'
df_test.columns = df_test.columns.str.replace('.','_')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM