简体   繁体   English

Spark SQL (Java) - 加入 X 个文件的廉价方式?

[英]Spark SQL (Java) - Inexpensive way to join X number of files?

I am currently working on a project where I am reading 19 different parquet files and joining on an ID.我目前正在研究一个项目,我正在阅读 19 个不同的镶木地板文件并加入一个 ID。 Some of these files have multiple rows per consumer, some have none.其中一些文件每个消费者有多行,有些没有。

I have a key file which has 1 column that I join on and another (userName) that I need, and I need all the columns of the other files.我有一个密钥文件,其中包含我加入的 1 列和我需要的另一个(用户名),并且我需要其他文件的所有列。

I create a different reader for each parquet file which reads the file and converts it into a spark dataset with a structure like this:我为每个 parquet 文件创建了一个不同的读取器,它读取文件并将其转换为具有如下结构的 spark 数据集:

GenericStructure1 record;
int id;

I then join all of these created datasets like this (imagine all 19):然后我像这样加入所有这些创建的数据集(想象一下所有 19 个):

keyDataset.join(dataSet1, dataSet1.col("id").equalTo(keyDataset.col("id")), "left_outer")
.join(dataSet19, dataSet19.col("id").equalTo(keyDataset.col("id")), "left_outer")
.groupBy(keyDataset.col("id"), keyDataset.col("userName"))
.agg(
    collect_set(dataSet1.col("record")).as("set1"),
    collect_set(dataSet19.col("record")).as("set19")
.select(
    keyDataset.col("id"),
    keyDataset.col("userName"),
    col("set1"),
    col("set19")
)
.as(Encoders.bean(Set.class));

where Set.class looks something like this:其中 Set.class 看起来像这样:

public class Set implements Serializable {
    long id;
    String userName;
    List<GenericStructure1> set1;
    List<GenericStructure19> set19;
}

This works fine for 100 records, but when I try to ramp up to one part of a 5mm parquet file (something like 75K records), it churns and burns through memory until ultimately it runs out.这适用于 100 条记录,但是当我尝试增加到 5 毫米镶木地板文件的一部分(类似于 75K 记录)时,它会搅动并烧毁内存,直到最终耗尽。 In production I need for this to be able to run on millions, so the fact that it chokes on 75K is a real problem.在生产中,我需要它能够运行数百万,因此它在 75K 上窒息的事实是一个真正的问题。 The only thing is, I don't see a straightforward way to optimize this so it can handle that kind of workload.唯一的问题是,我没有看到优化它的直接方法,因此它可以处理这种工作负载。 Does anybody know of an inexpensive way to join a large amount of data like shown above?有没有人知道一种廉价的方式来加入如上所示的大量数据?

I was able to get it to work.我能够让它工作。 In the question, I mention a keyDataset, which has all of the keys possible in all of the different datasets.在问题中,我提到了一个 keyDataset,它包含所有不同数据集中可能的所有键。 Instead of trying to join that against all of the other files right out of the gate, I instead broadcast the keyDataset and join against that after creating a generic dataframe for each dataset.我没有尝试将它与所有其他文件立即连接起来,而是广播 keyDataset 并在为每个数据集创建通用数据帧后连接它。

Dataset<Row> set1RowDataset = set1Dataset
        .groupBy(keyDataset.col(joinColumn))
        .agg(collect_set(set1Dataset.col("record")).as("set1s"))
        .select(
                keyDataset.col("id"),
                col("set1"));

Once I create 19 of those, I then join the generic datasets in their own join like so:一旦我创建了其中的 19 个,然后我将通用数据集加入它们自己的联接中,如下所示:

broadcast(set1RowDataset)
        .join(set2RowDataset, "id")
        .join(set3RowDataset, "id")
        .join(set4RowDataset, "id")
        .join(set19RowDataset, "id")
        .as(Encoders.bean(Set.class));

Performance-wise, I'm not sure how much of a hit I'm taking by doing the groupBy separately from the join, but my memory remains intact and Spark no longer spills so badly to disk during the shuffle.在性能方面,我不确定通过将 groupBy 与 join 分开执行会受到多大的打击,但我的记忆保持完整,并且 Spark 在洗牌过程中不再如此严重地溢出到磁盘。 I was able to run this on one part locally which was failing before as I mentioned above.我能够在本地的一个部分上运行它,正如我上面提到的那样,这在之前失败了。 I haven't tried it yet on the cluster with the full parquet file, but that's my next step.我还没有在带有完整镶木地板文件的集群上尝试过,但这是我的下一步。

I used this as my example: Broadcast Example我用这个作为我的例子: Broadcast Example

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM