简体   繁体   English

在Scala / Spark中合并两个表

[英]Merge two tables in Scala/Spark

I have two tab separated data files like below: 我有两个制表符分隔的数据文件,如下所示:

file 1: 文件1:

number  type    data_present
 1       a        yes
 2       b        no

file 2: 文件2:

type    group   number  recorded
 d       aa      10       true
 c       cc      20       false

I want to merge these two files so that output file looks like below: 我想合并这两个文件,以便输出文件如下所示:

number  type    data_present    group   recorded
  1      a         yes           NULL    NULL
  2      b         no            NULL    NULL
  10     d         NULL           aa     true
  20     cc        NULL           cc     false

As you can see, for columns which are not present in other file, I'm filling those places with NULL. 如您所见,对于其他文件中不存在的列,我用NULL填充了这些位置。

Any ideas on how to do this in Scala/Spark? 关于如何在Scala / Spark中执行此操作的任何想法?

Create two files for your data set: 为您的数据集创建两个文件:

$ cat file1.csv 
number  type    data_present
 1       a        yes
 2       b        no

$ cat file2.csv
type    group   number  recorded
 d       aa      10       true
 c       cc      20       false

Convert them to CSV: 将它们转换为CSV:

$ sed -e 's/^[ \t]*//' file1.csv | tr -s ' ' | tr ' ' ',' > f1.csv
$ sed -e 's/^[ ]*//' file2.csv | tr -s ' ' | tr ' ' ',' > f2.csv

Use spark-csv module to load CSV files as dataframes: 使用spark-csv模块将CSV文件加载为数据帧:

$ spark-shell --packages com.databricks:spark-csv_2.10:1.1.0

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df1 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f1.csv", "header" -> "true"))
val df2 = sqlContext.load("com.databricks.spark.csv", Map("path" -> "f2.csv", "header" -> "true"))

Now perform joins: 现在执行联接:

scala> df1.join(df2, df1("number") <=> df2("number") && df1("type") <=> df2("type"), "outer").show()

+------+----+------------+----+-----+------+--------+
|number|type|data_present|type|group|number|recorded|
+------+----+------------+----+-----+------+--------+
|     1|   a|         yes|null| null|  null|    null|
|     2|   b|          no|null| null|  null|    null|
|  null|null|        null|   d|   aa|    10|    true|
|  null|null|        null|   c|   cc|    20|   false|
+------+----+------------+----+-----+------+--------+

For more details goto here , here and here . 有关更多详细信息,请转到此处此处此处

这将为您提供所需的输出:

val output = file1.join(file2, Seq("number","type"), "outer")

简单地将所有列转换为String,而不是在两个DF上进行并集。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM