简体   繁体   中英

Is there any function which helps me convert date and string format in PySpark

Currently I am working in Pyspark and have little knowledge of this technology. My data frame looks like:

id       dob            var1
1       13-02-1976     aab@dfsfs
2       01-04-2000     bb@NAm
3       28-11-1979     adam11@kjfd
4       30-01-1955     rehan42@ggg

My output looks like:

id       dob            var1             age           var2
1       13-02-1976     aab@dfsfs         43            aab
2       01-04-2000     bb@NAm            19            bb
3       28-11-1979     adam11@kjfd       39            adam11
4       30-01-1955     rehan42@ggg       64            rehan42

What I have done so far -

df= df.select( df.id.cast('int').alias('id'),                                      
             df.dob.cast('date').alias('dob'),                                                                              
             df.var1.cast('string').alias('var1'))

But I think dob is not converted properly.

df= df.withColumn('age', F.datediff(F.current_date(), df.dob))

As you said Casting of dob column is not proper. Please Try this.

from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F

df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM- 
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id|       dob|       var1|date_in_dateFormat|
+---+----------+-----------+------------------+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|
|  2|01-04-2000|     bb@NAm|        2000-04-01|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|
+---+----------+-----------+------------------+

df2.printSchema()
root
 |-- id: integer (nullable = true)
 |-- dob: string (nullable = true)
 |-- var1: string (nullable = true)
 |-- date_in_dateFormat: date (nullable = true)

df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id|       dob|       var1|date_in_dateFormat|  age|
+---+----------+-----------+------------------+-----+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|15789|
|  2|01-04-2000|     bb@NAm|        2000-04-01| 6975|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|14405|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|23473|
+---+----------+-----------+------------------+-----+

split_col =F.split(df['var1'], '@')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id|       dob|       var1|date_in_dateFormat|  age|   Var2|
+---+----------+-----------+------------------+-----+-------+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|15789|    aab|
|  2|01-04-2000|     bb@NAm|        2000-04-01| 6975|     bb|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|14405| adam11|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM