I have two text files and I am creating data frame out of that. Both files have the same no of columns except one column.
When I crate schema and join both I get error like
java.lang.ArrayIndexOutOfBoundsException
Basically my schema has columns and my one of text file has only 5 columns.
No how to append some null value to already created schema and then do join?
Here is my code
val schema = StructType(Array(
StructField("TimeStamp", StringType),
StructField("Id", StringType),
StructField("Name", StringType),
StructField("Val", StringType),
StructField("Age", StringType),
StructField("Dept", StringType)))
val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)
val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)
val df3 = df1.join(df2)
TimeStamp column is not present in the first text file ...
Why don't you just exclude TimeStamp field from schema for first DataFrame?
val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))
As mentioned in comments, you need not schemas to be similar. You also can specify you join condition and select columns to join on.
You can create a new schema without this field, and use this schema. What Dmitri was suggesting is to use the original schema and remove the field that you don't need to save you writing a second schema definition.
Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this:
df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))
Add the Timestamp column to the 1st dataframe
import spark.sql.functions._
import org.apache.spark.sql.types.DataType
val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))
Then proceed with the join
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.