[英]pyspark - Updating a column based on a calculated value from another calculated column
df
.df
中。 A SQL table myTable
corresponding to this df
already exists and data will be imported from this df into myTable.df
对应的SQL表myTable
已经存在,数据将从这个df导入到myTable中。calculated
columns. calculated
列。 But these columns do not exist in csv file..withColumn("Column6", newFunction2(df.Column5))
of the code below..withColumn("Column6", newFunction2(df.Column5))
上抛出以下错误。 Question : What I may be doing wrong here.问题:我在这里可能做错了什么。 And how can we fix the error.
以及我们如何修复错误。 Note: If I remove Column6 from the myTable, and remove last line of the code below, the code successfully loads the data into myTable with data in column5 filled (as intended) with calculated values from Column1.
注意:如果我从 myTable 中删除 Column6,并删除下面代码的最后一行,则代码成功地将数据加载到 myTable 中,其中 column5 中的数据填充(按预期)使用来自 Column1 的计算值。
Error :错误:
AttributeError: 'DataFrame' object has no attribute 'Column6'
AttributeError:“DataFrame”对象没有属性“Column6”
Code :代码:
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="false")
def testFunction1(Col1Value):
#do some calculation on column1 value and return it to column5
return mystr1
def testFunction2(value):
# do some calculation on column5 value and return it to column6
return mystr2
newFunction1 = F.udf(testFunction1, StringType())
newFunction2 = F.udf(testFunction2, StringType())
df2 = df.withColumn("Column5", newFunction1(df.Column1)) \
.withColumn("Column6", newFunction2(df.Column5))
The problem is when you are creating df2.问题是当您创建 df2. You are reading the dataframe (df) and creating the column "Column5".
您正在读取数据框 (df) 并创建列“Column5”。 Then referencing the column at the second line.
然后引用第二行的列。 But "Column5" doesnt exist yet in df.
但是 df 中尚不存在“Column5”。 If you break up the last part into two statements such as the code below, it should solve the problem:
如果你把最后一部分分成两个语句,如下面的代码,它应该可以解决问题:
df2 = df.withColumn("Column5", newFunction1(df.Column1))
df3 = df2.withColumn("Column6", newFunction2(df.Column5))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.