简体   繁体   English

Spark Dataframe 列名更改不反映

[英]Spark Dataframe column name change does not reflect

I am trying to rename some special characters from my spark dataframe. For some weird reason, it shows the updated column name when I print the schema, but any attempt to access the data results in an error complaining about the old column name.我正在尝试重命名我的 spark dataframe 中的一些特殊字符。出于某种奇怪的原因,它在我打印模式时显示更新的列名,但任何访问数据的尝试都会导致错误,并抱怨旧的列名。 Here is what I am trying:这是我正在尝试的:

# Original Schema
upsertDf.columns

# Output: ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in upsertDf.columns:
    upsertDf = upsertDf.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
upsertDf.columns

# Works and returns expected result
# Output: ['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

# Print contents of dataframe
# Throws error for original attribute name "
upsertDf.show()

AnalysisException: 'Attribute name "col 0" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'

I have tried other options to rename the column (using alias etc...) and they all return the same error.我尝试了其他选项来重命名该列(使用别名等...),但它们都返回相同的错误。 Its almost as if the show operation is using a cached version of the schema but I can't figure out how to force it to use the new names.它几乎就像显示操作正在使用架构的缓存版本,但我无法弄清楚如何强制它使用新名称。

Has anyone run into this issue before?有没有人遇到过这个问题?

Have a look at this minimal example (using your renaming code, ran in a pyspark shell version 3.3.1):看看这个最小的例子(使用你的重命名代码,在pyspark shell 版本 3.3.1 中运行):

df = spark.createDataFrame(
    [("test", "test", "test", "test", "test", "test")],
    ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
)

df.columns
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))

df.columns
['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

df.show()
+-----+---------+-----------+------+---------+----------+
|col_0|col___0__|col____0___|col__0|col_____0|col______0|
+-----+---------+-----------+------+---------+----------+
| test|     test|       test|  test|     test|      test|
+-----+---------+-----------+------+---------+----------+

As you see, this executes successfully.如您所见,这执行成功。 So your renaming functionality is OK.所以你的重命名功能没问题。

Since you haven't shared all your code (how upsertDf is defined), we can't really know exactly what's going on.由于您尚未共享所有代码( upsertDf的定义方式),我们无法真正知道到底发生了什么。 But looking at your error message, this comes from ParquetSchemaConverter.scala in a Spark version earlier than 3.2.0 (this error message changed in 3.2.0 , see SPARK-34402 ).但是查看您的错误消息,这来自早于3.2.0的 Spark 版本中的ParquetSchemaConverter.scala (此错误消息在3.2.0中更改,请参阅SPARK-34402 )。

Make sure that you read in your data and then immediately rename the columns, without doing any other operation.确保您读入数据后立即重命名列,而不进行任何其他操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM