替換spark Dataframe中所有列名中的空格

Question

我有一些列名稱中帶有空格的 spark 數據框，必須用下划線替換。

我知道可以在 sparkSQL 中使用withColumnRenamed()重命名單個列，但是要重命名“n”個列，此函數必須鏈接“n”次（據我所知）。

為了自動化這個，我試過：

val old_names = df.columns()        // contains array of old column names

val new_names = old_names.map { x => 
   if(x.contains(" ") == true) 
      x.replaceAll("\\s","_") 
   else x 
}                    // array of new column names with removed whitespace.

現在，如何用new_names替換 df 的標題

Answer 1

作為最佳實踐，您應該更喜歡表達式和不變性。 您應該盡可能使用val而不是var 。

因此，在這種情況下，最好使用foldLeft運算符：

val newDf = df.columns
              .foldLeft(df)((curr, n) => curr.withColumnRenamed(n, n.replaceAll("\\s", "_")))

Answer 2

  var newDf = df
  for(col <- df.columns){
    newDf = newDf.withColumnRenamed(col,col.replaceAll("\\s", "_"))
  }

你可以用某種方法把它封裝起來，這樣就不會造成太大的污染。

Answer 3

在 Python 中，這可以通過以下代碼完成：

# Importing sql types
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql.functions import col

# Building a simple dataframe:
schema = StructType([
             StructField("id name", StringType(), True),
             StructField("cities venezuela", StringType(), True)
         ])

column1 = ['A', 'A', 'B', 'B', 'C', 'B']
column2 = ['Maracaibo', 'Valencia', 'Caracas', 'Barcelona', 'Barquisimeto', 'Merida']

# Dataframe:
df = sqlContext.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

exprs = [col(column).alias(column.replace(' ', '_')) for column in df.columns]
df.select(*exprs).show()

Answer 4

你可以在 python 中做同樣的事情：

raw_data1 = raw_data
for col in raw_data.columns:
  raw_data1 = raw_data1.withColumnRenamed(col,col.replace(" ", "_"))

Answer 5

在 Scala 中，這是實現相同的另一種方法 -

    import org.apache.spark.sql.types._

    val df_with_newColumns = spark.createDataFrame(df.rdd, 
StructType(df.schema.map(s => StructField(s.name.replaceAll(" ", ""), 
s.dataType, s.nullable))))

希望這可以幫助！！

Answer 6

我也想添加這個解決方案

import re
for each in df.schema.names:
    df = df.withColumnRenamed(each, re.sub(r'\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*','',each.replace(' ', '')))

Answer 7

我一直在使用@kanielc 給出的答案來修剪列標題中的前導和尾隨空格，當列數較少時效果很好。 我不得不加載一個包含大約 600 列的 csv 文件，並且代碼的執行花費了足夠的時間並且沒有達到我們的預期。

早期代碼：

val finalSourceTable = intermediateSourceTable.columns
.foldLeft(intermediateSourceTable)((curr, n) => curr.withColumnRenamed(n, n.trim))

更改代碼：

val finalSourceTable = intermediateSourceTable
.toDF(intermediateSourceTable.columns map (_.trim()): _*)

更改后的代碼非常有效，而且與之前的代碼相比速度也很快。 此外，我們通過不使用 var 變量來保持不變性。

Answer 8

這是我們正在使用的實用程序。

 def columnsStandardise(df: DataFrame): DataFrame = {
    val dfcolumnsStandardise= df.toDF(df.columns map (_.toLowerCase().trim().replaceAll(" ","_")): _*)
    (dfcolumnsStandardise)
  }

替換spark Dataframe中所有列名中的空格

問題描述

8 個解決方案

解決方案1
21 2017-07-10 17:10:33

解決方案2
18 已采納 2016-03-15 19:15:55

解決方案3
14 2016-03-15 19:12:07

解決方案4
10 2018-07-13 13:04:37

解決方案5
0 2019-06-21 03:12:12

解決方案6
0 2022-09-15 22:29:34

解決方案7
0 2022-12-13 12:33:07

解決方案8
-1 2022-06-06 08:29:03

替換spark Dataframe中所有列名中的空格

問題描述

8 個解決方案

解決方案1 21 2017-07-10 17:10:33

解決方案2 18 已采納 2016-03-15 19:15:55

解決方案3 14 2016-03-15 19:12:07

解決方案4 10 2018-07-13 13:04:37

解決方案5 0 2019-06-21 03:12:12

解決方案6 0 2022-09-15 22:29:34

解決方案7 0 2022-12-13 12:33:07

解決方案8 -1 2022-06-06 08:29:03

解決方案1
21 2017-07-10 17:10:33

解決方案2
18 已采納 2016-03-15 19:15:55

解決方案3
14 2016-03-15 19:12:07

解決方案4
10 2018-07-13 13:04:37

解決方案5
0 2019-06-21 03:12:12

解決方案6
0 2022-09-15 22:29:34

解決方案7
0 2022-12-13 12:33:07

解決方案8
-1 2022-06-06 08:29:03