在pyspark Dataframe上創建新的架構或列名稱

Question

我看到了這篇文章，它對我很有幫助，除了我需要使用列表來更改數據框的標題，因為它很長，並且隨輸入的每個數據集而變化，所以我真的不能在新的數據庫中寫出/硬編碼列名。

例如：

df = sqlContext.read.load("./assets/"+filename, 
                          format='com.databricks.spark.csv', 
                          header='false', 
                          inferSchema='false')
devices = df.first()
metrics = df.take(2)[1]
# Adding the two header rows together as one as a way of later searching through and sorting rows
# delimiter is "..." since it doesn't occur anywhere in the data and we don't have to wory about multiple splits
header = [str(devices[i]) +"..."+ str(metrics[i]) for i in range(len(devices))]

df2 = df.toDF(header)

然后，我當然會收到此錯誤：

IllegalArgumentException：u“要求失敗：列數不匹配。\\ n舊列名（278）：

標頭的長度= 278，列數相同。 因此，真正的問題是，當我擁有新名稱列表時，如何對數據框中的標頭進行非硬編碼重命名？

我懷疑我必須不以實際列表對象的形式進行輸入，但是如何做到這一點而又不遍歷每一列（使用selectexpr或別名並使用一個新的更新列創建多個新的dfs（不可變的）一次嗎？

Answer 1

您可以遍歷舊的列名，並為它們提供新的列名作為別名。 一個好的方法是在python中使用zip函數。

首先，讓我們創建列名列表：

old_cols = df.columns
new_cols = [str(d) + "..." + str(m) for d, m in zip(devices, metrics)]

盡管我假設“ ...”指的是另一個python對象，因為“ ...”在列名中不是一個好的字符序列。

最后：

df2 = df.select([df[oc].alias(nc) for oc, nc in zip(old_cols, new_cols)])

Answer 2

我嘗試了另一種方法。 由於我想模擬硬編碼的列表（而不是實際的列表對象），因此我使用了exec（）語句，並在其中使用了所有鏈接的標頭創建的字符串。

注意：此限制為255列。 因此，如果您想要的更多，則必須將其分解

for i in range(len(header)):
    # For the first of the column names, need to initiate the string header_str
    if i == 0:
        header_str = "'" + str(header[i])+"',"
    # For the last of the names, need a different string to close it without a comma
    elif i == len(header)-1:
        header_str = header_str + "'" + header[i] + "'"
    #For everything in the middle: just add it all together the same way
    else:
        header_str = header_str + "'" + header[i] + "',"

exec("df2 = df.toDF("+ header_str +")")

在pyspark Dataframe上創建新的架構或列名稱

問題描述

2 個解決方案

解決方案1
0 2017-08-31 16:31:51

解決方案2
0 2017-09-01 02:21:49

在pyspark Dataframe上創建新的架構或列名稱

問題描述

2 個解決方案

解決方案1 0 2017-08-31 16:31:51

解決方案2 0 2017-09-01 02:21:49

解決方案1
0 2017-08-31 16:31:51

解決方案2
0 2017-09-01 02:21:49