I have a use-case where I need to select certain columns from a dataframe
containing atleast 30 columns and millions of rows.
I'm loading this data from a cassandra
table using scala
and apache-spark
.
I selected the required columns using: df.select("col1","col2","col3","col4")
Now I have to perform a basic groupBy
operation to group the data according to src_ip
, src_port
, dst_ip
, dst_port
and I also want to have the latest value from a received_time
column of the original dataframe
.
I want a dataframe
with distinct
src_ip
values with their count
and latest received_time
in a new column as last_seen
.
I know how to use .withColumn
and also, I think that .map()
can be used here. Since I'm relatively new in this field, I really don't know how to proceed further. I could really use your help to get done with this task.
Assuming you have a dataframe df with src_ip,src_port,dst_ip,dst_port and received_time
, you can try:
val mydf = df.groupBy(col("src_ip"),col("src_port"),col("dst_ip"),col("dst_port")).agg(count("received_time").as("row_count"),max(col("received_time")).as("max_received_time"))
The above line calculates the count of timestamp received against the group by columns as well as the max timestamp for that group by columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.