简体   繁体   English

在Spark Scala中将两个文本文件的架构不同的一列合并

[英]Join two text file with one column different in their schema in spark scala

I have two text files and I am creating data frame out of that. 我有两个文本文件,我正在从中创建数据框。 Both files have the same no of columns except one column. 除一个列外,两个文件的列数均相同。

When I crate schema and join both I get error like 当我创建架构并同时加入两者时,出现类似

java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException

Basically my schema has columns and my one of text file has only 5 columns. 基本上,我的架构有几列,而我的文本文件只有五列。

No how to append some null value to already created schema and then do join? 否如何将一些null值附加到已创建的架构中,然后进行联接?

Here is my code 这是我的代码

val schema = StructType(Array(
  StructField("TimeStamp", StringType),
  StructField("Id", StringType),
  StructField("Name", StringType),
  StructField("Val", StringType),
  StructField("Age", StringType),
  StructField("Dept", StringType)))

val textRdd1 = sc.textFile("s3://test/Text1.txt")
val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split(",", -1)))
var df1 = sqlContext.createDataFrame(rowRdd1, schema)

val textRdd2 = sc.textFile("s3://test/Text2.txt")
val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split(",", -1)))
var df2 = sqlContext.createDataFrame(rowRdd2, schema)

val df3 = df1.join(df2)

TimeStamp column is not present in the first text file ... 第一个文本文件中不存在TimeStamp列...

Why don't you just exclude TimeStamp field from schema for first DataFrame? 为什么不从第一个DataFrame的架构中排除TimeStamp字段?

val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))

As mentioned in comments, you need not schemas to be similar. 如评论中所述,您无需使架构相似。 You also can specify you join condition and select columns to join on. 您还可以指定联接条件并选择要联接的列。

You can create a new schema without this field, and use this schema. 您可以创建一个没有该字段的新架构,并使用该架构。 What Dmitri was suggesting is to use the original schema and remove the field that you don't need to save you writing a second schema definition. Dmitri的建议是使用原始架构,并删除不需要的字段,以免您编写第二个架构定义。

Once you have the 2 files loaded in to a dataset you perform the JOIN base in the common fields and remove the duplicate columns, that I guess is what you want, doing this: 将2个文件加载到数据集中后,您可以在公共字段中执行JOIN基础并删除重复的列,我想这就是您想要的,执行此操作:

df3 = df1.join(df2, (df1("Id") === df2("Id")) && (df1("Name") === df2("Name")) && (df1("Val") === df2("Val")) && (df1("Age") === df2("Age")) && (df1("Dept") === df2("Dept")))
.drop(df2("Id"))
.drop(df2("Name"))
.drop(df2("Val"))
.drop(df2("Age"))
.drop(df2("Dept"))

Add the Timestamp column to the 1st dataframe 将时间戳列添加到第一个数据帧

 import spark.sql.functions._
 import org.apache.spark.sql.types.DataType 
 val df1Final = df1.withColumn("TimeStamp", lit(null).cast(Long))

Then proceed with the join 然后继续加入

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM