简体   繁体   中英

How to populate a Spark DataFrame column based on another column's value?

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.

I'm loading this data from a cassandra table using scala and apache-spark .

I selected the required columns using: df.select("col1","col2","col3","col4")

Now I have to perform a basic groupBy operation to group the data according to src_ip , src_port , dst_ip , dst_port and I also want to have the latest value from a received_time column of the original dataframe . 原始数据帧

I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen .

I know how to use .withColumn and also, I think that .map() can be used here. Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.

Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time , you can try:

val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))

The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM