简体   繁体   中英

How to filter the data from Rdd and save it to text file using scala in spark

I have two datasets which are in TSV format. I want to read both TSV file in spark scala and perform analysis. File 1 has Aug data and File 2 has Sep data. How do I read both tsv file using scala in spark and save output to another tsv file. I want to use intersection operation on both the RRD

Below are the two TSV file format. File 1

在此处输入图像描述

File 2

在此处输入图像描述

The output file should have the App_Name which are accessed in both the months.

Output file data.

在此处输入图像描述

val dfTsv1 = spark.read.format("com.databricks.spark.csv")
      .option("delimiter", "\t")
      .load("filepath1")
    val dfTsv2 = spark.read.format("com.databricks.spark.csv")
      .option("delimiter", "\t").load("filepath2")


    val duplicateColumns = List("") // put your duplicate column names here 
    val outputDf = dfTsv1.alias("tcv1").join(dfTsv2.alias("tcv2"),dfTsv1("ACCESSED_MONTH") === dfTsv1("ACCESSED_MONTH"))
      .drop(duplicateColumns: _*)

    outputDf.show()

The intersection is nothing but the inner join, simply perform an inner join operation on both the Dataframes. Refer Spark SQL Joins

val df = df1.join(df2, Seq("APP_NAME"), "inner")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM