[英]Is there any function which helps me convert date and string format in PySpark
Currently I am working in Pyspark and have little knowledge of this technology. 目前,我在Pyspark工作,对该技术了解甚少。 My data frame looks like: 我的数据框如下所示:
id dob var1
1 13-02-1976 aab@dfsfs
2 01-04-2000 bb@NAm
3 28-11-1979 adam11@kjfd
4 30-01-1955 rehan42@ggg
My output looks like: 我的输出看起来像:
id dob var1 age var2
1 13-02-1976 aab@dfsfs 43 aab
2 01-04-2000 bb@NAm 19 bb
3 28-11-1979 adam11@kjfd 39 adam11
4 30-01-1955 rehan42@ggg 64 rehan42
What I have done so far - 我到目前为止所做的-
df= df.select( df.id.cast('int').alias('id'),
df.dob.cast('date').alias('dob'),
df.var1.cast('string').alias('var1'))
But I think dob
is not converted properly. 但是我认为dob
转换不正确。
df= df.withColumn('age', F.datediff(F.current_date(), df.dob))
As you said Casting of dob column is not proper. 就像您说的那样,浇铸多普勒柱是不合适的。 Please Try this. 请尝试这个。
from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F
df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM-
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id| dob| var1|date_in_dateFormat|
+---+----------+-----------+------------------+
| 1|13-02-1976| aab@dfsfs| 1976-02-13|
| 2|01-04-2000| bb@NAm| 2000-04-01|
| 3|28-11-1979|adam11@kjfd| 1979-11-28|
| 4|30-01-1955|rehan42@ggg| 1955-01-30|
+---+----------+-----------+------------------+
df2.printSchema()
root
|-- id: integer (nullable = true)
|-- dob: string (nullable = true)
|-- var1: string (nullable = true)
|-- date_in_dateFormat: date (nullable = true)
df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id| dob| var1|date_in_dateFormat| age|
+---+----------+-----------+------------------+-----+
| 1|13-02-1976| aab@dfsfs| 1976-02-13|15789|
| 2|01-04-2000| bb@NAm| 2000-04-01| 6975|
| 3|28-11-1979|adam11@kjfd| 1979-11-28|14405|
| 4|30-01-1955|rehan42@ggg| 1955-01-30|23473|
+---+----------+-----------+------------------+-----+
split_col =F.split(df['var1'], '@')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id| dob| var1|date_in_dateFormat| age| Var2|
+---+----------+-----------+------------------+-----+-------+
| 1|13-02-1976| aab@dfsfs| 1976-02-13|15789| aab|
| 2|01-04-2000| bb@NAm| 2000-04-01| 6975| bb|
| 3|28-11-1979|adam11@kjfd| 1979-11-28|14405| adam11|
| 4|30-01-1955|rehan42@ggg| 1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.