Join two text file with one column different in their schema in spark scala

Question

I have two text files and I am creating data frame out of that. Both files have the same no of columns except one column.

When I crate schema and join both I get error like

java.lang.ArrayIndexOutOfBoundsException

Basically my schema has columns and my one of text file has only 5 columns.

No how to append some null value to already created schema and then do join?

Here is my code

val schema = StructType(Array(
  StructField("TimeStamp", StringType),
  StructField("Id", StringType),
  StructField("Name", StringType),
  StructField("Val", StringType),
  StructField("Age", StringType),
  StructField("Dept", StringType)))

val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)

val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)

val df3 = df1.join(df2)

TimeStamp column is not present in the first text file ...

Answer 1

Why don't you just exclude TimeStamp field from schema for first DataFrame?

val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))

As mentioned in comments, you need not schemas to be similar. You also can specify you join condition and select columns to join on.

Answer 2

You can create a new schema without this field, and use this schema. What Dmitri was suggesting is to use the original schema and remove the field that you don't need to save you writing a second schema definition.

Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this:

df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))

Answer 3

Add the Timestamp column to the 1st dataframe

 import spark.sql.functions._
 import org.apache.spark.sql.types.DataType 
 val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))

Then proceed with the join

Join two text file with one column different in their schema in spark scala

Question

3 answers

solution1
0 ACCPTED 2017-09-29 07:12:38

solution2
0 2017-09-29 08:21:44

solution3
0 2017-09-29 09:44:36

Join two text file with one column different in their schema in spark scala

Question

3 answers

solution1 0 ACCPTED 2017-09-29 07:12:38

solution2 0 2017-09-29 08:21:44

solution3 0 2017-09-29 09:44:36

solution1
0 ACCPTED 2017-09-29 07:12:38

solution2
0 2017-09-29 08:21:44

solution3
0 2017-09-29 09:44:36