I'm creating a time_interval column and adding it to an existing Data-frame in Pyspark . Ideally the time_interval will be in the " HHmm " format with the minutes being rounded down to the nearest 15 minute mark (815, 830, 845, 900, etc).
I have the spark sql code that does the logic for me but how do I take that value that's concatenated as string column and insert that into an existing Data-frame?
time_interval = sqlContext.sql("select extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15")
time_interval.show()
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|concat(CAST(hour(current_timestamp()) AS STRING), CAST((FLOOR((CAST(minute(current_timestamp()) AS DOUBLE) / CAST(15 AS DOUBLE))) * CAST(15 AS BIGINT)) AS STRING))|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 1045|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
baseDF = sqlContext.sql("select * from test_table")
newBase = baseDF.withColumn("time_interval", lit(str(time_interval)))
newBase.select("time_interval").show()
+--------------------+
| time_interval|
+--------------------+
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
+--------------------+
only showing top 20 rows
So the actual expected results should be just showing the actual string value in the new column i'm creating rather than this concatenated value from a data-frame. Something like below:
newBase.select("time_interval").show(1)
+-------------+
|time_interval|
+-------------+
| 1045 |
+-------------+
As time_interval
is a dataframe type, for this case need to collect
and extract the required value out from dataframe
.
Try this way:
newBase = baseDF.withColumn("time_interval", lit(str(time_interval.collect()[0][0])))
newBase.show()
(or)
By using select(expr())
function:
newBase = baseDF.select("*",expr("string(extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15) AS time_interval"))
As pault mentioned in comments, using selectExpr()
function:
newBase = baseDF.selectExpr("*","string(extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15) AS time_interval")
Example:
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import IntegerType
>>> time_interval = spark.sql("select extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15")
>>> baseDF=spark.createDataFrame([1,2,3,4],IntegerType())
>>> newBase = baseDF.withColumn("time_interval", lit(str(time_interval.collect()[0][0])))
>>> newBase.show()
+-----+-------------+
|value|time_interval|
+-----+-------------+
| 1| 1245|
| 2| 1245|
| 3| 1245|
| 4| 1245|
+-----+-------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.