Spark Dataframe 列名更改不反映

Question

我正在尝试重命名我的 spark dataframe 中的一些特殊字符。出于某种奇怪的原因，它在我打印模式时显示更新的列名，但任何访问数据的尝试都会导致错误，并抱怨旧的列名。 这是我正在尝试的：

# Original Schema
upsertDf.columns

# Output: ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in upsertDf.columns:
    upsertDf = upsertDf.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
upsertDf.columns

# Works and returns expected result
# Output: ['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

# Print contents of dataframe
# Throws error for original attribute name "
upsertDf.show()

AnalysisException: 'Attribute name "col 0" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'

我尝试了其他选项来重命名该列（使用别名等...），但它们都返回相同的错误。 它几乎就像显示操作正在使用架构的缓存版本，但我无法弄清楚如何强制它使用新名称。

有没有人遇到过这个问题？

Answer 1

看看这个最小的例子（使用你的重命名代码，在pyspark shell 版本 3.3.1 中运行）：

df = spark.createDataFrame(
    [("test", "test", "test", "test", "test", "test")],
    ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
)

df.columns
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))

df.columns
['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

df.show()
+-----+---------+-----------+------+---------+----------+
|col_0|col___0__|col____0___|col__0|col_____0|col______0|
+-----+---------+-----------+------+---------+----------+
| test|     test|       test|  test|     test|      test|
+-----+---------+-----------+------+---------+----------+

如您所见，这执行成功。 所以你的重命名功能没问题。

由于您尚未共享所有代码（ upsertDf的定义方式），我们无法真正知道到底发生了什么。 但是查看您的错误消息，这来自早于3.2.0的 Spark 版本中的ParquetSchemaConverter.scala （此错误消息在3.2.0中更改，请参阅SPARK-34402 ）。

确保您读入数据后立即重命名列，而不进行任何其他操作。

Spark Dataframe 列名更改不反映

问题描述

1 个解决方案

解决方案1
0 2023-01-26 16:36:26

Spark Dataframe 列名更改不反映

问题描述

1 个解决方案

解决方案1 0 2023-01-26 16:36:26

解决方案1
0 2023-01-26 16:36:26