简体   繁体   中英

Spark - How to handle dataframe

I'm new dev about spark,I need your help, My issue, I read file from csv. And in csv file I have more row have format is [logDate, id]

Example:

2017-01-11 09:00:00, a
2017-01-11 09:30:00, b
2017-01-11 08:00:00, b

I want dataframe after handle will structure as [lastLoginDate, id, firstLoginDate] .

Result expected is: (2017-01-11 09:30:00, a, 2017-01-11 09:00:00) (2017-01-11 08:00:00, b,2017-01-11 08:00:00) .

Now, I have one solution but I want to find faster way. I read csv file in dataframe. After that I sort dataframe by id and log_date in 2 way (asc and desc). Finally, I join 2 dataframe that I have from sorting to get fields last login date and first login date.

And my schema is

|-- game_code: string (nullable = true) 
|-- last_login_date: string (nullable = true) 
|-- register_date: string (nullable = true) 
|-- id: string (nullable = true) 
|-- sid: string (nullable = true) 
|-- os: string (nullable = true) 
|-- devive: string (nullable = true) 
|-- deviceId: string (nullable = true)

You can use first and last inbuilt functions to get the final dataframe you require as

df.orderBy("logDate").groupBy("id").agg(last("logDate").as("lastLoginDate"), first("logDate").as("firstLoginDate"))

You should get the result as

+---+---------------------+---------------------+
|id |lastLoginDate        |firstLoginDate       |
+---+---------------------+---------------------+
| a |2017-01-11 09:00:00.0|2017-01-11 09:00:00.0|
| b |2017-01-11 09:30:00.0|2017-01-11 08:00:00.0|
+---+---------------------+---------------------+

I hope the answer is helpful

Updated

You can include the rest of the columns in the aggregation if you want them all as

import org.apache.spark.sql.functions._
df.orderBy("last_login_date").groupBy("id")
  .agg(first("last_login_date").as("firstLoginDate"),
    last("last_login_date").as("lastLoginDate"),
    first("game_code").as("game_code"),
    first("register_date").as("register_date"),
    first("sid").as("sid"),
    first("os").as("os"),
    first("devive").as("devive"),
    first("deviceId").as("deviceId"))
  .show(false)

Note: you can go ahead and try using Window function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM