简体   繁体   中英

Join two text file with one column different in their schema in spark scala

I have two text files and I am creating data frame out of that. Both files have the same no of columns except one column.

When I crate schema and join both I get error like

java.lang.ArrayIndexOutOfBoundsException

Basically my schema has columns and my one of text file has only 5 columns.

No how to append some null value to already created schema and then do join?

Here is my code

val schema = StructType(Array(
  StructField("TimeStamp", StringType),
  StructField("Id", StringType),
  StructField("Name", StringType),
  StructField("Val", StringType),
  StructField("Age", StringType),
  StructField("Dept", StringType)))

val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)

val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)

val df3 = df1.join(df2)

TimeStamp column is not present in the first text file ...

Why don't you just exclude TimeStamp field from schema for first DataFrame?

val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))

As mentioned in comments, you need not schemas to be similar. You also can specify you join condition and select columns to join on.

You can create a new schema without this field, and use this schema. What Dmitri was suggesting is to use the original schema and remove the field that you don't need to save you writing a second schema definition.

Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this:

df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))

Add the Timestamp column to the 1st dataframe

 import spark.sql.functions._
 import org.apache.spark.sql.types.DataType 
 val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))

Then proceed with the join

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM