如何將 Pyspark 數據幀標題設置為另一行？

Question

我有一個看起來像這樣的數據框：

# +----+------+---------+
# |col1| col2 |  col3   |
# +----+------+---------+
# |  id| name |    val  |
# |  1 |  a01 |    X    |
# |  2 |  a02 |    Y    |
# +---+-------+---------+

我需要從中創建一個新的數據框，使用 row[1] 作為新的列標題並忽略或刪除 col1、col2 等行。 新表應如下所示：

# +----+------+---------+
# | id | name |   val   |
# +----+------+---------+
# |  1 |  a01 |    X    |
# |  2 |  a02 |    Y    |
# +---+-------+---------+

列可以是可變的，因此我無法使用名稱在新數據框中顯式設置它們。 這不是使用熊貓 df 的。

Answer 1

假設只有一行的id在 col1 ， name在 col2 ， val在 col3 ，您可以使用以下邏輯（為了清晰和解釋而進行注釋）

#select the row with the header name 
header = df.filter((df['col1'] == 'id') & (df['col2'] == 'name') & (df['col3'] == 'val'))

#selecting the rest of the rows except the first one 
restDF = df.subtract(header)

#converting the header row into Row 
headerColumn = header.first()

#looping columns for renaming 
for column in restDF.columns:
    restDF = restDF.withColumnRenamed(column, headerColumn[column])

restDF.show(truncate=False)

這應該給你

+---+----+---+
|id |name|val|
+---+----+---+
|1  |a01 |X  |
|2  |a02 |Y  |
+---+----+---+

但是最好的選擇是在使用sqlContext從源讀取數據幀時將header 選項設置為 true來讀取它

Answer 2

你試過這個嗎？ 標題=真

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()
df = spark.read.csv("TSCAINV_062020.csv",header=True)

如果標題未設置為 True，Pyspark 會將列名稱設置為 _c0、_c1、_c2，並將列向下推一行。

如何將 Pyspark 數據幀標題設置為另一行？

問題描述

2 個解決方案

解決方案1
2 已采納 2018-05-30 05:50:25

解決方案2
1 2020-06-19 13:08:20

如何將 Pyspark 數據幀標題設置為另一行？

問題描述

2 個解決方案

解決方案1 2 已采納 2018-05-30 05:50:25

解決方案2 1 2020-06-19 13:08:20

解決方案1
2 已采納 2018-05-30 05:50:25

解決方案2
1 2020-06-19 13:08:20