如何按索引重命名 PySpark 数据框列？（处理重复的列名）

Question

我有一个问题，我需要动态更新 Spark 数据框中的列。

基本上我需要遍历列列表，如果该列已经存在于列表中，则将其重命名为该列及其索引。

我尝试的代码是这样的：

def dup_cols(df):
  for i, icol in enumerate(df.columns):
    for x, xcol in enumerate(df.columns):
      if icol == xcol and i != x:
        df = df.withColumnsRenamed(xcol, xcol + '_' + str(x))
  return df

但这按名称重命名（此处为xcol ），因此无法解决我的问题。

我可以更改此设置以按索引重命名数据框中的列吗？ 我已经搜索了很长一段时间，但一无所获。

我也无法转换为 Pandas 数据帧，因此我需要一个 Spark/PySpark 解决方案来仅通过索引重命名特定列。

谢谢！

Answer 1

您可以使用pyspark.sql.DataFrame.toDF()重命名列：

返回一个新类：具有新指定列名的DataFrame

下面是一个例子：

data = [
    (1, 2, 3),
    (4, 5, 6),
    (7, 8, 9)
]

df = spark.createDataFrame(data, ["a", "b", "a"])
df.printSchema()
#root
# |-- a: long (nullable = true)
# |-- b: long (nullable = true)
# |-- a: long (nullable = true)

根据您的索引逻辑创建新名称：

new_names = []
counter = {c: -1 for c in df.columns}
for c in df.columns:
    new_c = c
    counter[c] += 1
    new_c += str(counter[c]) if counter[c] else ""
    new_names.append(new_c)
print(new_names)
#['a', 'b', 'a1']

现在使用toDF()创建一个具有新列名的新 DataFrame：

df = df.toDF(*new_names)
df.printSchema()
#root
# |-- a: long (nullable = true)
# |-- b: long (nullable = true)
# |-- a1: long (nullable = true)

Answer 2

假设 dt 是当前数据帧

new_columns = []

for i in range(1, len(dt.columns)):

   new_columns.append("new_column_name)

for c, n in zip(dt.columns[1:], new_columns):

    dt = dt.withColumnRenamed(c, n)

如何按索引重命名 PySpark 数据框列？（处理重复的列名）

问题描述

2 个解决方案

解决方案1
7 已采纳 2018-12-13 19:07:28

解决方案2
-1 2020-01-02 19:42:27

如何按索引重命名 PySpark 数据框列？ （处理重复的列名）

问题描述

2 个解决方案

解决方案1 7 已采纳 2018-12-13 19:07:28

解决方案2 -1 2020-01-02 19:42:27

如何按索引重命名 PySpark 数据框列？（处理重复的列名）

解决方案1
7 已采纳 2018-12-13 19:07:28

解决方案2
-1 2020-01-02 19:42:27