如何将一个数据帧的连接值插入Pyspark中的另一个数据帧？

Question

I'm creating a time_interval column and adding it to an existing Data-frame in Pyspark . 我正在创建一个time_interval列，并将其添加到Pyspark中的现有Data-frame中 。 Ideally the time_interval will be in the " HHmm " format with the minutes being rounded down to the nearest 15 minute mark (815, 830, 845, 900, etc). 理想情况下，time_interval的格式应为“ HHmm ”，将分钟四舍五入到最接近的15分钟标记（ 815、830、845、900等）。

I have the spark sql code that does the logic for me but how do I take that value that's concatenated as string column and insert that into an existing Data-frame? 我有可以为我做逻辑的spark sql代码，但是如何获取串联为字符串列的值并将其插入现有的Data-frame中呢？

time_interval = sqlContext.sql("select extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15")

time_interval.show()

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|concat(CAST(hour(current_timestamp()) AS STRING), CAST((FLOOR((CAST(minute(current_timestamp()) AS DOUBLE) / CAST(15 AS DOUBLE))) * CAST(15 AS BIGINT)) AS STRING))|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                               1045|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+

baseDF = sqlContext.sql("select * from test_table")
newBase = baseDF.withColumn("time_interval", lit(str(time_interval)))

newBase.select("time_interval").show()

+--------------------+
|       time_interval|
+--------------------+
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
|DataFrame[concat(...|
+--------------------+
only showing top 20 rows

So the actual expected results should be just showing the actual string value in the new column i'm creating rather than this concatenated value from a data-frame. 因此，实际的预期结果应该只是在我正在创建的新列中显示实际的字符串值，而不是数据框中的此串联值。 Something like below: 如下所示：

newBase.select("time_interval").show(1)
+-------------+
|time_interval|
+-------------+
|    1045     |                                                                                                                                           
+-------------+

Answer 1

As time_interval is a dataframe type, for this case need to collect and extract the required value out from dataframe . 由于time_interval是数据帧类型，因此在这种情况下，需要从time_interval中collect并extract the required value out from dataframe 。

Try this way: 尝试这种方式：

newBase = baseDF.withColumn("time_interval", lit(str(time_interval.collect()[0][0])))
newBase.show()

(or) （要么）

By using select(expr()) function: 通过使用select(expr())函数：

newBase = baseDF.select("*",expr("string(extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15) AS time_interval"))

As pault mentioned in comments, using selectExpr() function: 如评论中提到的保险库中所述，使用selectExpr()函数：

newBase = baseDF.selectExpr("*","string(extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15) AS time_interval")

Example: 例：

>>> from pyspark.sql.functions import *
>>> from pyspark.sql.types import IntegerType
>>> time_interval = spark.sql("select extract(hour from current_timestamp())||floor(extract(minute from current_timestamp())/15)*15")
>>> baseDF=spark.createDataFrame([1,2,3,4],IntegerType())
>>> newBase = baseDF.withColumn("time_interval", lit(str(time_interval.collect()[0][0])))
>>> newBase.show()
+-----+-------------+
|value|time_interval|
+-----+-------------+
|    1|         1245|
|    2|         1245|
|    3|         1245|
|    4|         1245|
+-----+-------------+

如何将一个数据帧的连接值插入Pyspark中的另一个数据帧？

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-05-30 17:42:47

如何将一个数据帧的连接值插入Pyspark中的另一个数据帧？

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-05-30 17:42:47

解决方案1
0 已采纳 2019-05-30 17:42:47