在pyspark Dataframe上创建新的架构或列名称

Question

I saw this post and it was somewhat helpful except that I need to change the headers of a dataframe using a list, because it's long and changes with every dataset I input, so I can't really write out/ hard-code in the new column names. 我看到了这篇文章，它对我很有帮助，除了我需要使用列表来更改数据框的标题，因为它很长，并且随输入的每个数据集而变化，所以我真的不能在新的数据库中写出/硬编码列名。

Ex: 例如：

df = sqlContext.read.load("./assets/"+filename, 
                          format='com.databricks.spark.csv', 
                          header='false', 
                          inferSchema='false')
devices = df.first()
metrics = df.take(2)[1]
# Adding the two header rows together as one as a way of later searching through and sorting rows
# delimiter is "..." since it doesn't occur anywhere in the data and we don't have to wory about multiple splits
header = [str(devices[i]) +"..."+ str(metrics[i]) for i in range(len(devices))]

df2 = df.toDF(header)

Then of course I get this error: 然后，我当然会收到此错误：

IllegalArgumentException: u"requirement failed: The number of columns doesn't match.\\nOld column names (278): IllegalArgumentException：u“要求失败：列数不匹配。\\ n旧列名（278）：

The length of header = 278 and the number of columns is the same. 标头的长度= 278，列数相同。 So, the real question is, how do I do a non-hard-coded re-naming of headers in a dataframe when I have a list of the new names? 因此，真正的问题是，当我拥有新名称列表时，如何对数据框中的标头进行非硬编码重命名？

I'm suspecting I have to make the input not in the form of an actual list object, but how do I do this without iterating through each column (with a selectexpr or alias and creating several new dfs (immutable) with one new updated column at a time? (yuck) 我怀疑我必须不以实际列表对象的形式进行输入，但是如何做到这一点而又不遍历每一列（使用selectexpr或别名并使用一个新的更新列创建多个新的dfs（不可变的）一次吗？

Answer 1

You can iterate through the old column names and give them your new column names as aliases. 您可以遍历旧的列名，并为它们提供新的列名作为别名。 A good way to do this is to use function zip in python. 一个好的方法是在python中使用zip函数。

First let's create our column names lists: 首先，让我们创建列名列表：

old_cols = df.columns
new_cols = [str(d) + "..." + str(m) for d, m in zip(devices, metrics)]

Although I'm assuming "..." refers to another python object, because "..." wouldn't be a good character sequence in a column name. 尽管我假设“ ...”指的是另一个python对象，因为“ ...”在列名中不是一个好的字符序列。

Finally: 最后：

df2 = df.select([df[oc].alias(nc) for oc, nc in zip(old_cols, new_cols)])

Answer 2

I tried a different approach. 我尝试了另一种方法。 Since I wanted to simulate the hard coded list (and not actual list object), I used the exec() statement with a string created with all the linked headers. 由于我想模拟硬编码的列表（而不是实际的列表对象），因此我使用了exec（）语句，并在其中使用了所有链接的标头创建的字符串。

Note: this is limitted to 255 columns. 注意：此限制为255列。 So if you want more than that, you'll have to break it up 因此，如果您想要的更多，则必须将其分解

for i in range(len(header)):
    # For the first of the column names, need to initiate the string header_str
    if i == 0:
        header_str = "'" + str(header[i])+"',"
    # For the last of the names, need a different string to close it without a comma
    elif i == len(header)-1:
        header_str = header_str + "'" + header[i] + "'"
    #For everything in the middle: just add it all together the same way
    else:
        header_str = header_str + "'" + header[i] + "',"

exec("df2 = df.toDF("+ header_str +")")

在pyspark Dataframe上创建新的架构或列名称

问题描述

2 个解决方案

解决方案1
0 2017-08-31 16:31:51

解决方案2
0 2017-09-01 02:21:49

在pyspark Dataframe上创建新的架构或列名称

问题描述

2 个解决方案

解决方案1 0 2017-08-31 16:31:51

解决方案2 0 2017-09-01 02:21:49

解决方案1
0 2017-08-31 16:31:51

解决方案2
0 2017-09-01 02:21:49