Extract only date from timestamp from column in a DataFrame - Spark in Java

Question

I have cloudera-quickstart-vm-5.13.0 environment. In this environment Hadoop and Spark are already installed. I have put a csv file into hdfs. Then, i wrote this java code to read the csv and try to count how many taxi routes are for every day (eg, for the day 10/10/2019 there are 29 taxi routes, for the day 11/10/2019 there are 16 taxi routes and so on..). Csv file fields are:

●taxi_id 
●pickup_datetime 
●passengers 
●pick_lon 
●pick_lat

.My java code is:

package com.bigdata.taxi;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;


public class Main {

    public static void main(String[] args) {
        // TODO Auto-generated method stub

        SparkConf conf = new SparkConf();
        conf.setAppName("My 1st Spark app");
        conf.setMaster("local[*]");
        JavaSparkContext sc = new JavaSparkContext(conf);

        SparkSession sparkSession = SparkSession.builder().sparkContext(sc.sc()).getOrCreate();

        //Now read csv , from hdfs source
        //[cloudera@quickstart ~]$ hdfs dfs -put /home/cloudera/Desktop/fares.csv hdfs://quickstart.cloudera:8020//user//cloudera//fares.csv
        Dataset<Row> df = sparkSession.read().option("header", true).option("inferSchema", "true").
                option("timestampFormat", "yyyy-MM-dd hh:mm:ss").csv("hdfs://quickstart.cloudera:8020//user//cloudera//fares.csv");
        df.show(); //only showing top 20 rows

        Dataset<Row> df2 = df.orderBy("pickup_datetime").groupBy("pickup_datetime").count();
        df2.show();
    }
}

.But, my issue is that the pickup_datetime field does not contain only the date but also hours, minutes and seconds. So, how can i remove hh:mm:ss from column pickup_datetime in the dataframe through java?

Thanks!

Answer 1

You can add a new column that only contains the date. date_format is helpful here.

df = df.withColumn("pickup_date", date_format(col("pickup_datetime"), "yyyy-MM-dd"));

In the following code, just use the column pickup_date instead of pickup_datetime .

Note : you will need to import the Spark functions :

import static org.apache.spark.sql.functions.*;

Extract only date from timestamp from column in a DataFrame - Spark in Java

Question

1 answers

solution1
1 ACCPTED 2019-12-15 19:25:52

Extract only date from timestamp from column in a DataFrame - Spark in Java

Question

1 answers

solution1 1 ACCPTED 2019-12-15 19:25:52

solution1
1 ACCPTED 2019-12-15 19:25:52