简体   繁体   English

是否有任何功能可以帮助我在PySpark中转换日期和字符串格式

[英]Is there any function which helps me convert date and string format in PySpark

Currently I am working in Pyspark and have little knowledge of this technology. 目前,我在Pyspark工作,对该技术了解甚少。 My data frame looks like: 我的数据框如下所示:

id       dob            var1
1       13-02-1976     aab@dfsfs
2       01-04-2000     bb@NAm
3       28-11-1979     adam11@kjfd
4       30-01-1955     rehan42@ggg

My output looks like: 我的输出看起来像:

id       dob            var1             age           var2
1       13-02-1976     aab@dfsfs         43            aab
2       01-04-2000     bb@NAm            19            bb
3       28-11-1979     adam11@kjfd       39            adam11
4       30-01-1955     rehan42@ggg       64            rehan42

What I have done so far - 我到目前为止所做的-

df= df.select( df.id.cast('int').alias('id'),                                      
             df.dob.cast('date').alias('dob'),                                                                              
             df.var1.cast('string').alias('var1'))

But I think dob is not converted properly. 但是我认为dob转换不正确。

df= df.withColumn('age', F.datediff(F.current_date(), df.dob))

As you said Casting of dob column is not proper. 就像您说的那样,浇铸多普勒柱是不合适的。 Please Try this. 请尝试这个。

from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F

df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM- 
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id|       dob|       var1|date_in_dateFormat|
+---+----------+-----------+------------------+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|
|  2|01-04-2000|     bb@NAm|        2000-04-01|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|
+---+----------+-----------+------------------+

df2.printSchema()
root
 |-- id: integer (nullable = true)
 |-- dob: string (nullable = true)
 |-- var1: string (nullable = true)
 |-- date_in_dateFormat: date (nullable = true)

df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id|       dob|       var1|date_in_dateFormat|  age|
+---+----------+-----------+------------------+-----+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|15789|
|  2|01-04-2000|     bb@NAm|        2000-04-01| 6975|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|14405|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|23473|
+---+----------+-----------+------------------+-----+

split_col =F.split(df['var1'], '@')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id|       dob|       var1|date_in_dateFormat|  age|   Var2|
+---+----------+-----------+------------------+-----+-------+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|15789|    aab|
|  2|01-04-2000|     bb@NAm|        2000-04-01| 6975|     bb|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|14405| adam11|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM