Spark Dataframe column name change does not reflect

Question

I am trying to rename some special characters from my spark dataframe. For some weird reason, it shows the updated column name when I print the schema, but any attempt to access the data results in an error complaining about the old column name. Here is what I am trying:

# Original Schema
upsertDf.columns

# Output: ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in upsertDf.columns:
    upsertDf = upsertDf.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
upsertDf.columns

# Works and returns expected result
# Output: ['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

# Print contents of dataframe
# Throws error for original attribute name "
upsertDf.show()

AnalysisException: 'Attribute name "col 0" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'

I have tried other options to rename the column (using alias etc...) and they all return the same error. Its almost as if the show operation is using a cached version of the schema but I can't figure out how to force it to use the new names.

Has anyone run into this issue before?

Answer 1

Have a look at this minimal example (using your renaming code, ran in a pyspark shell version 3.3.1):

df = spark.createDataFrame(
    [("test", "test", "test", "test", "test", "test")],
    ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
)

df.columns
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))

df.columns
['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

df.show()
+-----+---------+-----------+------+---------+----------+
|col_0|col___0__|col____0___|col__0|col_____0|col______0|
+-----+---------+-----------+------+---------+----------+
| test|     test|       test|  test|     test|      test|
+-----+---------+-----------+------+---------+----------+

As you see, this executes successfully. So your renaming functionality is OK.

Since you haven't shared all your code (how upsertDf is defined), we can't really know exactly what's going on. But looking at your error message, this comes from ParquetSchemaConverter.scala in a Spark version earlier than 3.2.0 (this error message changed in 3.2.0 , see SPARK-34402 ).

Make sure that you read in your data and then immediately rename the columns, without doing any other operation.

Spark Dataframe column name change does not reflect

Question

1 answers

solution1
0 2023-01-26 16:36:26

Spark Dataframe column name change does not reflect

Question

1 answers

solution1 0 2023-01-26 16:36:26

solution1
0 2023-01-26 16:36:26