Selecting Some Columns and a Max Value of A specific Column From Spark Dataset

Question

Hi I have a JAVA spark Dataset when I give dataset.show(); gives below OutPut.

Col1    col2    rowNum

obj1    item1    1
obj1    item2    2
obj1    item3    3
obj2    item1    4
obj2    item3    5
obj3    item4    6

Of the Same Dataset, I want to get below output,

Col1    max(rownum)

obj1    3
obj2    5
obj3    6

I'm totally new to JAVA spark, Can anyone help me to get the above output from the same Dataset, and the return the last max(rownum) that is 6 in the above case.

Answer 1

The below code would give the required output:

SparkSession s=SparkSession.builder().appName("Stack Overflow Example test").master("local[*]").getOrCreate();
DataFrameReader read=s.read();
Dataset<Row> resp=read.option("header","true").csv("D://test.csv");
Dataset<Row> withColumn = resp.withColumn("rowNum", resp.col("rowNum").cast("long"));
Dataset<Row> orderBy = withColumn.orderBy(resp.col("Col1"));
orderBy.groupBy(resp.col("Col1")).max("rowNum").show();

Output:

+----+-----------+
|Col1|max(rowNum)|
+----+-----------+
|obj1|          3|
|obj2|          5|
|obj3|          6|
+----+-----------+

I have used to header information, to ensure we get the schema.

Here the casting needs to be done for the rowNum column to either Integer or Long. The ordering , followed by a group by to get the maximum value can be performed following that.

Answer 2

This is a pretty simple use case, so I can give you some tips. Try using the DataSet Java documentation: https://spark.apache.org/docs/2.2.0/api/java/index.html?org/apache/spark/sql/Dataset.html

You want to use the groupBy function to group rows by Col1 . You'll be returned a RelationalGroupedDataset: https://spark.apache.org/docs/2.2.0/api/java/index.html?org/apache/spark/sql/RelationalGroupedDataset.html

You can use the max function to aggregate based on whatever columns you choose. Let me know if you have trouble with this.

Answer 3

in Java

String input = "C://Users//U6048715//Desktop//test.csv";                
// create new spark Session
SparkSession spark = SparkSession.builder().master("local[2]").appName("pivot Table").getOrCreate()
// loading file data                
Dataset<Row> file = spark.read().format("csv").option("header","true").load(input);

// creating view 
file.createTempView("tempTable");                   

Dataset<Row> result = spark.sql(" select col1 , max(row) from tempTable group by col1");                
result.show();

Selecting Some Columns and a Max Value of A specific Column From Spark Dataset

Question

3 answers

solution1
1 2018-06-09 06:03:53

solution2
0 2018-02-22 20:07:47

solution3
0 2020-01-20 12:17:33

Selecting Some Columns and a Max Value of A specific Column From Spark Dataset

Question

3 answers

solution1 1 2018-06-09 06:03:53

solution2 0 2018-02-22 20:07:47

solution3 0 2020-01-20 12:17:33

solution1
1 2018-06-09 06:03:53

solution2
0 2018-02-22 20:07:47

solution3
0 2020-01-20 12:17:33