How to filter the data from Rdd and save it to text file using scala in spark

Question

I have two datasets which are in TSV format. I want to read both TSV file in spark scala and perform analysis. File 1 has Aug data and File 2 has Sep data. How do I read both tsv file using scala in spark and save output to another tsv file. I want to use intersection operation on both the RRD

Below are the two TSV file format. File 1

File 2

The output file should have the App_Name which are accessed in both the months.

Output file data.

Answer 1

val dfTsv1 = spark.read.format("com.databricks.spark.csv")
      .option("delimiter", "\t")
      .load("filepath1")
    val dfTsv2 = spark.read.format("com.databricks.spark.csv")
      .option("delimiter", "\t").load("filepath2")


    val duplicateColumns = List("") // put your duplicate column names here 
    val outputDf = dfTsv1.alias("tcv1").join(dfTsv2.alias("tcv2"),dfTsv1("ACCESSED_MONTH") === dfTsv1("ACCESSED_MONTH"))
      .drop(duplicateColumns: _*)

    outputDf.show()

Answer 2

The intersection is nothing but the inner join, simply perform an inner join operation on both the Dataframes. Refer Spark SQL Joins

val df = df1.join(df2, Seq("APP_NAME"), "inner")

How to filter the data from Rdd and save it to text file using scala in spark

Question

2 answers

solution1
0 2020-04-27 16:32:29

solution2
0 2020-04-27 16:43:40

How to filter the data from Rdd and save it to text file using scala in spark

Question

2 answers

solution1 0 2020-04-27 16:32:29

solution2 0 2020-04-27 16:43:40

solution1
0 2020-04-27 16:32:29

solution2
0 2020-04-27 16:43:40