简体   繁体   中英

Azure Databricks Python add derived column based on content existing column

I have a dataframe loaded in spark that comes from a csv file. However I want to add two columns tot he dataframe and the content of these columns is dependent on the contant of one column that is allready there.

The column that I allready have is called YearWeek and can contain the wkxxxx_yy or xxxx_yy, where xxxx is the year.

I need to add a column named Period and a column named Year The new column period can only contain the values "Weekly" or "Monthly". If the column YearWeek starts with "wk" then the column Period should have the value "Weekly", otherwise "Monthly"

I did some searching off course and came up with the following piece of code:

> df4 = df3.withcolumn(NewColumn5, when          
>     df3.col("YearWeek").startswith("wk"),yearweek[3:6].otherwise(YearWeek[1:4]))\
>     .withcolumn(NewColumn1, when df3.col("YearWeek").startswith("wk"),"Weekly".otherwise("Monthly"))

However, this results in a syntax error

SyntaxError: invalid syntax
File "<command-2818966973632811>", line 61
df4 = df3.withcolumn(NewColumn5, when 
df3.col("YearWeek").startswith("wk"),yearweek[3:6].otherwise(YearWeek[1:4]))\
                                    ^
SyntaxError: invalid syntax

What am I doing wrong?

In the meanwhile I solved it differently. I just read the csv-files and puit them in one big dataframe. After that I make table from the dataframe

df4.createOrReplaceTempView(tablename)

The I use spark SQL to add the derived columns based on the content of the YearWeek column. Actually very easy for me becasue I am a SQL-guy

df5 = spark.sql("select Somecolumn1,\
                    Somecolumn2,\
                     Somecolumn3,\
                     Somecolumn4,\
                     YearWeek,\
                     Somecolumn5,\
                     Somecolumn6,\
                     Somecolumn7,\
                     Somecolumn8,\
                     Somecolumn9,\
                     Somecolumn10,\
                     Somecolumn11,\
                     Somecolumn12,\
                     CASE WHEN LEFT(YearWeek,2) = 'WK' THEN 'Weekly' ELSE 'Monthly' END AS Period,\
                     CASE WHEN LEFT(YearWeek,2) = 'WK' THEN substring(YearWeek, 3,4) ELSE substring(YearWeek,1,4) END AS Year from " + tablename)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM