I'm new dev about spark,I need your help, My issue, I read file from csv. And in csv file I have more row have format is [logDate, id]
Example:
2017-01-11 09:00:00, a
2017-01-11 09:30:00, b
2017-01-11 08:00:00, b
I want dataframe after handle will structure as [lastLoginDate, id, firstLoginDate]
.
Result expected is: (2017-01-11 09:30:00, a, 2017-01-11 09:00:00) (2017-01-11 08:00:00, b,2017-01-11 08:00:00)
.
Now, I have one solution but I want to find faster way. I read csv file in dataframe. After that I sort dataframe by id and log_date in 2 way (asc and desc). Finally, I join 2 dataframe that I have from sorting to get fields last login date and first login date.
And my schema is
|-- game_code: string (nullable = true)
|-- last_login_date: string (nullable = true)
|-- register_date: string (nullable = true)
|-- id: string (nullable = true)
|-- sid: string (nullable = true)
|-- os: string (nullable = true)
|-- devive: string (nullable = true)
|-- deviceId: string (nullable = true)
You can use first
and last
inbuilt functions to get the final dataframe you require as
df.orderBy("logDate").groupBy("id").agg(last("logDate").as("lastLoginDate"), first("logDate").as("firstLoginDate"))
You should get the result as
+---+---------------------+---------------------+
|id |lastLoginDate |firstLoginDate |
+---+---------------------+---------------------+
| a |2017-01-11 09:00:00.0|2017-01-11 09:00:00.0|
| b |2017-01-11 09:30:00.0|2017-01-11 08:00:00.0|
+---+---------------------+---------------------+
I hope the answer is helpful
Updated
You can include the rest of the columns in the aggregation if you want them all as
import org.apache.spark.sql.functions._
df.orderBy("last_login_date").groupBy("id")
.agg(first("last_login_date").as("firstLoginDate"),
last("last_login_date").as("lastLoginDate"),
first("game_code").as("game_code"),
first("register_date").as("register_date"),
first("sid").as("sid"),
first("os").as("os"),
first("devive").as("devive"),
first("deviceId").as("deviceId"))
.show(false)
Note: you can go ahead and try using Window
function.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.