I have two datasets which are in TSV format. I want to read both TSV file in spark scala and perform analysis. File 1 has Aug data and File 2 has Sep data. How do I read both tsv file using scala in spark and save output to another tsv file. I want to use intersection operation on both the RRD
Below are the two TSV file format. File 1
File 2
The output file should have the App_Name which are accessed in both the months.
Output file data.
val dfTsv1 = spark.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("filepath1")
val dfTsv2 = spark.read.format("com.databricks.spark.csv")
.option("delimiter", "\t").load("filepath2")
val duplicateColumns = List("") // put your duplicate column names here
val outputDf = dfTsv1.alias("tcv1").join(dfTsv2.alias("tcv2"),dfTsv1("ACCESSED_MONTH") === dfTsv1("ACCESSED_MONTH"))
.drop(duplicateColumns: _*)
outputDf.show()
The intersection is nothing but the inner join, simply perform an inner join operation on both the Dataframes. Refer Spark SQL Joins
val df = df1.join(df2, Seq("APP_NAME"), "inner")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.