df
. A SQL table myTable
corresponding to this df
already exists and data will be imported from this df into myTable.calculated
columns. But these columns do not exist in csv file..withColumn("Column6", newFunction2(df.Column5))
of the code below.Question : What I may be doing wrong here. And how can we fix the error. Note: If I remove Column6 from the myTable, and remove last line of the code below, the code successfully loads the data into myTable with data in column5 filled (as intended) with calculated values from Column1.
Error :
AttributeError: 'DataFrame' object has no attribute 'Column6'
Code :
from pyspark.sql.types import StringType
from pyspark.sql import functions as F
df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="false")
def testFunction1(Col1Value):
#do some calculation on column1 value and return it to column5
return mystr1
def testFunction2(value):
# do some calculation on column5 value and return it to column6
return mystr2
newFunction1 = F.udf(testFunction1, StringType())
newFunction2 = F.udf(testFunction2, StringType())
df2 = df.withColumn("Column5", newFunction1(df.Column1)) \
.withColumn("Column6", newFunction2(df.Column5))
The problem is when you are creating df2. You are reading the dataframe (df) and creating the column "Column5". Then referencing the column at the second line. But "Column5" doesnt exist yet in df. If you break up the last part into two statements such as the code below, it should solve the problem:
df2 = df.withColumn("Column5", newFunction1(df.Column1))
df3 = df2.withColumn("Column6", newFunction2(df.Column5))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.