简体   繁体   English

Spark数据帧加入问题

[英]Spark dataframe Join issue

Below code snippet works fine. 下面的代码片段工作正常。 (Read CSV, Read Parquet and join each other) (阅读CSV,阅读实木复合地板并互相加入)

//Reading csv file -- getting three columns: Number of records: 1
 df1=spark.read.format("csv").load(filePath) 

df2=spark.read.parquet(inputFilePath)

//Join with Another table : Number of records: 30 Million, total 
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1")  "right")

Its weired that below code snippet doesnt work. 它在下面的代码片段下工作不起作用。 (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase) (阅读Hbase,阅读Parquet并互相加入)(差异是从Hbase读取)

//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
 df1=read from Hbase code
 // It read from Hbase properly and able to show one record.
 df1.show

df2=spark.read.parquet(inputFilePath)

//Join with Another table : Number of records: 50 Million, total 
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1")  "right")

Error : Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) 错误 :由以下原因引起:org.apache.spark.SparkException:作业因阶段失败而中止:56个任务(1024.4 MB)的序列化结果总大小大于spark.driver.maxResultSize(1024.0 MB)

Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). 然后我添加了spark.driver.maxResultSize = 5g,然后又出现了另一个错误,Java堆空间错误(在ThreadPoolExecutor.java上运行)。 If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. 如果我在Manager中观察内存使用情况,我会看到该用法一直持续到达50GB左右,此时会发生OOM错误。 So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use. 因此无论出于何种原因,用于执行此操作的RAM量比我尝试使用的RDD的大小大约10倍。

If I persist df1 in memory&disk and do a count(). 如果我将df1保留在内存和磁盘中并执行count()。 Program works fine. 程序运行正常。 Code snippet is below 代码段如下

//Reading from Hbase -- getting three columns: Number of records: 1
 df1=read from Hbase code

**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**

df2=spark.read.parquet(inputFilePath)

//Join with Another table : Number of records: 50 Million, total 
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1")  "right")

It works with file even it has the same data but not with Hbase. 它适用于文件,即使它具有相同的数据,但不与Hbase一起使用。 Running this on 100 worknode cluster with 125 GB of memory on each. 在100个worknode集群上运行此程序,每个集群上有125 GB的内存。 So memory is not the problem. 所以记忆不是问题。

My question here is both the file and Hbase has same data and both read and able to show() the data. 我的问题是文件和Hbase都有相同的数据,并且都读取并能够显示()数据。 But why only Hbase is failing. 但为什么只有Hbase失败。 I am struggling to understand what might be going wrong with this code. 我很难理解这段代码可能出现的问题。 Any suggestions will be appreciated. 任何建议将不胜感激。

When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join. 当提取数据时,spark不知道从HBase检索的行数,因此选择的策略是排序合并连接。

thus it tries to sort and shuffle the data across the executors. 因此,它试图对执行程序中的数据进行排序和混洗。

to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet. 为了避免这个问题,我们可以使用广播连接,同时我们不会使用密钥列对来自df2的数据进行排序和混洗,密钥列显示代码片段中的最后一个语句。

however to bypass this (since it is only one row) we can use Case expression for the columns to be padded. 但是要绕过这个(因为它只有一行),我们可以使用Case表达式来填充列。

example: 例:

df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
    ,lit(hbaseValueCol1))
 .otherwise(lit(null))

I'm sometimes struggling with this error too. 我有时也在努力解决这个错误。 Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). 通常这种情况发生在spark尝试在连接期间广播一个大表时(当spark的优化器低估了表的大小或统计信息不正确时)。 As there is no hint to force sort-merge join ( How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)? ), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1 由于没有提示强制排序合并连接( 如何提示排序合并连接或混洗散列连接(以及跳过广播散列连接)? ),唯一的选择是通过设置spark.sql.autoBroadcastJoinThreshold= -1来禁用广播连接spark.sql.autoBroadcastJoinThreshold= -1

When I have problem with memory during a join it usually means one of two reasons: 当我在加入期间遇到内存问题时,通常意味着以下两个原因之一:

  1. You have too few partitions in dataframes (partitions are too big) 数据帧中的分区太少(分区太大)
  2. There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory. 您加入的密钥上的两个数据框中有许多重复项,并且连接会破坏您的内存。

Ad 1. I think you should look at number of partitions you have in each table before join. 广告1.我认为您应该在加入之前查看每个表中的分区数。 When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). 当Spark读取文件时,它不一定保持与原始表(镶木地板,csv或其他)相同数量的分区。 Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. 从csv读取vs读取HBase可能会创建不同数量的分区,这就是您看到性能差异的原因。 Too large partitions become even larger after join and this creates memory problem. 加入后,太大的分区变得更大,这会产生内存问题。 Have a look at the Peak Execution Memory per task in Spark UI. 在Spark UI中查看每个任务的Peak Execution Memory。 This will give you some idea about your memory usage per task. 这将让您了解每个任务的内存使用情况。 I found it best to keep it below 1 Gb. 我发现最好将它保持在1 Gb以下。

Solution: Repartition your tables before the join. 解决方案:在加入之前重新分区表。

Ad. 广告。 2 Maybe not the case here but worth checking. 2也许不是这里的情况,但值得检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM