简体   繁体   English

Spark:将数据集的 2 列合并为一列

[英]Spark: Merging 2 columns of a DataSet into a single column

I have a table with ids as 2 different columns.我有一个 id 为 2 个不同列的表。 I have another table which contains objects associated with the ids.我有另一个表,其中包含与 ID 关联的对象。 I would like filter out the ids from the table2 for which id exists in either id1 or id2 in Table 1.我想从表 1 中的 id1 或 id2 中过滤出 id 存在的 table2 中的 id。

Table 1:表格1:

| id1  | id2 |
|  1   |  1  |
|  1   |  1  |
|  1   |  3  |
|  2   |  5  |
|  3   |  1  | 
|  3   |  2  |
|  3   |  3  |

Table 2:表 2:

| id  | obj   |
|  1  |  'A'  |
|  2  |  'B'  |
|  3  |  'C'  |
|  4  |  'D'  | 
|  5  |  'E'  |  
|  6  |  'F'  |
|  7  |  'G'  |

What I am thinking is to create out a list from table1 containing the unique ids which will be [1, 2, 3, 5] from above example.我在想的是从 table1 创建一个列表,其中包含上面示例中的 [1, 2, 3, 5] 的唯一 ID。

Then filter out the data frame on the basis of the list, which will give the result.然后根据列表过滤掉数据框,就会给出结果。

| id  | obj   |
|  1  |  'A'  |
|  2  |  'B'  |
|  3  |  'C'  |
|  5  |  'E'  |  

Though I have concerns regarding the scalability of the solution.虽然我担心解决方案的可扩展性。 The list can be large and it may even fail to load into memory for some cases.该列表可能很大,在某些情况下甚至可能无法加载到 memory 中。 Any recommendations here in this case?在这种情况下,这里有什么建议吗?

Thanks.谢谢。

Use spark SQL - Note - joins in spark come with a whole set of performance considerations including DF size, key distribution etc. so please familiarise yourself.使用 spark SQL - 注意 - 在 spark 中加入会带来一整套性能注意事项,包括 DF 大小、密钥分布等,因此请熟悉自己。

Generally though:不过一般来说:

table2.as("t2")
  .join(
    table1.as("t1"),
    $"t2.id" === $"t1.id1" || $"t2.id" === $"t1.id2",
    "left"
  )
  .where($"t1.id1".isNull)
  .select("t2.*")

Another approach:另一种方法:

val id_table = table1.select(explode(array('*)).as("id")).distinct()
val result = table2.join(id_table,"id")
result.show()

output: output:

+---+---+
| id|obj|
+---+---+
|  1|'A'|
|  2|'B'|
|  3|'C'|
|  5|'E'|
+---+---+

The following approach would work以下方法将起作用

      import spark.implicits._
      val t1 = Seq((1,1),(1,1),(1,3),(2,5),(3,1),(3,2),(3,3))
      val t2 = Seq((1,"A"),(2,"B"),(3,"C"),(4,"D"),(5,"E"),(6,"F"),(7,"G"))
      val tt1 = sc.parallelize(t1).toDF("id1","id2")
                  .persist(StorageLevel.MEMORY_AND_DISK)
      val tt2 = sc.parallelize(t2).toDF("id", "obj")
                  .persist(StorageLevel.MEMORY_AND_DISK)

      tt1.show()
      tt2.show()

      tt1.createOrReplaceTempView("table1")
      tt2.createOrReplaceTempView("table2")

     val output = sqlContext.sql(
        """
          |SELECT DISTINCT id, obj
          |FROM table1 t1
          |JOIN table2 t2 ON(t1.id1 = t2.id) OR (t1.id2 = id)
          |ORDER BY id
          |""".stripMargin).persist(StorageLevel.MEMORY_AND_DISK)

      output.show()

output output

+---+---+
| id|obj|
+---+---+
|  1|  A|
|  2|  B|
|  3|  C|
|  5|  E|
+---+---+

For memory issues you can persist the data to memory and disk, however there are more options, you can choose the best option that fit with your particular problem, you can follow this link: RDD Persistence对于 memory 问题,您可以将数据持久化到 memory 和磁盘,但是还有更多选项,您可以选择适合您特定问题的最佳选项,您可以点击此链接: RDD Persistence

I would consider too the number of partitions by configuring:我也会通过配置来考虑分区的数量:

spark.sql.shuffle.partitions
/*
Configures the number of partitions to use when shuffling data for joins or aggregations.
*/

  val spark = SparkSession
    .builder()
    .appName("MySparkProcess")
    .master("local[*]")
    .config("spark.sql.shuffle.partitions","400") //Change to a more reasonable default number of partitions for our data
    .config("spark.app.id","MySparkProcess") // To silence Metrics warning
    .getOrCreate()

I would take a look to this link too for further configuration:我也会查看此链接以进行进一步配置:

Performance Tuning 性能调优

I hope this helps.我希望这有帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM