How to populate a Spark DataFrame column based on another column's value?

Question

I have a use-case where I need to select certain columns from a dataframe containing atleast 30 columns and millions of rows.

I'm loading this data from a cassandra table using scala and apache-spark .

I selected the required columns using: df.select("col1","col2","col3","col4")

Now I have to perform a basic groupBy operation to group the data according to src_ip , src_port , dst_ip , dst_port and I also want to have the latest value from a received_time column of the original dataframe .

I want a dataframe with distinct src_ip values with their count and latest received_time in a new column as last_seen .

I know how to use .withColumn and also, I think that .map() can be used here. Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.

Answer 1

Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time , you can try:

val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))

The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.

How to populate a Spark DataFrame column based on another column's value?

Question

1 answers

solution1
0 ACCPTED 2020-09-16 09:57:04

How to populate a Spark DataFrame column based on another column's value?

Question

1 answers

solution1 0 ACCPTED 2020-09-16 09:57:04

solution1
0 ACCPTED 2020-09-16 09:57:04