pyspark - 根据另一个计算列的计算值更新列

Question

Following code loads data from csv file into dataframe df .以下代码将数据从 csv 文件加载到数据帧df中。 A SQL table myTable corresponding to this df already exists and data will be imported from this df into myTable.这个df对应的SQL表myTable已经存在，数据将从这个df导入到myTable中。
myTable has several columns. myTable 有几列。 Column5 and Column6 exists in myTable and are calculated columns. Column5 和 Column6 存在于 myTable 中，并且是calculated列。 But these columns do not exist in csv file.但是这些列在 csv 文件中不存在。
The values of Column5 are calculated based on values from Column1. Column5 的值是根据 Column1 中的值计算得出的。 And the values of Column6 are calculated based on the calculated values from Column5.并且 Column6 的值是根据 Column5 的计算值计算的。 These values are calculated by testFunction1 and testFunction2 respectivley.这些值分别由 testFunction1 和 testFunction2 计算。
The code works fine for Column5.该代码适用于 Column5。 But throws the following error on the last line .withColumn("Column6", newFunction2(df.Column5)) of the code below.但是在下面代码的最后一行.withColumn("Column6", newFunction2(df.Column5))上抛出以下错误。

Question : What I may be doing wrong here.问题：我在这里可能做错了什么。 And how can we fix the error.以及我们如何修复错误。 Note: If I remove Column6 from the myTable, and remove last line of the code below, the code successfully loads the data into myTable with data in column5 filled (as intended) with calculated values from Column1.注意：如果我从 myTable 中删除 Column6，并删除下面代码的最后一行，则代码成功地将数据加载到 myTable 中，其中 column5 中的数据填充（按预期）使用来自 Column1 的计算值。

Error :错误：

AttributeError: 'DataFrame' object has no attribute 'Column6' AttributeError：“DataFrame”对象没有属性“Column6”

Code :代码：

from pyspark.sql.types import StringType
from pyspark.sql import functions as F

df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="false")

def testFunction1(Col1Value):
  #do some calculation on column1 value and return it to column5
  return mystr1

def testFunction2(value):
  # do some calculation on column5 value and return it to column6
  return mystr2

newFunction1 = F.udf(testFunction1, StringType())
newFunction2 = F.udf(testFunction2, StringType())

df2 = df.withColumn("Column5", newFunction1(df.Column1)) \
      .withColumn("Column6", newFunction2(df.Column5))

Answer 1

The problem is when you are creating df2.问题是当您创建 df2. You are reading the dataframe (df) and creating the column "Column5".您正在读取数据框 (df) 并创建列“Column5”。 Then referencing the column at the second line.然后引用第二行的列。 But "Column5" doesnt exist yet in df.但是 df 中尚不存在“Column5”。 If you break up the last part into two statements such as the code below, it should solve the problem:如果你把最后一部分分成两个语句，如下面的代码，它应该可以解决问题：

df2 = df.withColumn("Column5", newFunction1(df.Column1))
df3 = df2.withColumn("Column6", newFunction2(df.Column5))

pyspark - 根据另一个计算列的计算值更新列

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-29 20:21:48

pyspark - 根据另一个计算列的计算值更新列

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-29 20:21:48

解决方案1
1 已采纳 2022-05-29 20:21:48