简体   繁体   English

Spark / Scala-在Hive表中加载数据框的更快方法

[英]Spark/ Scala- Faster Way to Load Dataframe in Hive Table

I have a data frame which I intend to write as a Hive partitioned table. 我有一个要写为Hive分区表的数据框。 The code I use to that end is 为此,我使用的代码是

for(i <- 0 until 10)
{
  myDf = hiveContext.sql("select * from srcTable where col = "+i)
  myDf.write.mode("append").format("parquet")
        .partitionBy("period_id").saveAsTable("myTable")
}

myDf will contain a different set of data in every iteration (I have just shown an oversimplified way of how I get values in myDf ) myDf在每次迭代中将包含一组不同的数据(我刚刚展示了一种过分简化的方式来获取myDf值)

The myDf.write takes about 5 minutes to load 120,000 rows of data. myDf.write大约需要5分钟来加载120,000行数据。 Is there any way I could further reduce the time taken to write all this data? 有什么办法可以进一步减少写入所有这些数据所需的时间?

first, why do you iterate and not just lead/save all data at once? 首先,为什么要迭代,而不仅仅是一次领导/保存所有数据? Second, I could imagine that with your code, you write too many (small) files, you could check that on the filesystem. 其次,我可以想象使用您的代码编写太多(小的)文件,然后可以在文件系统上进行检查。 Normally I repartition my dataframe according to the same column which I use as partition-column of the DataFrameWriter , like this I get only 1 file per partition (as long as it isn't too large, otherwise HDFS will automatically split the file): 通常,我会根据与DataFrameWriter分区列相同的列对数据帧进行分区,这样我每个分区仅获得1个文件(只要它不是太大,否则HDFS会自动拆分文件):

val cols = (0 until 10)

hiveContext.table("srcTable")
  .where($"col".isin(cols:_*))
  .repartition($"period_id")
  .write
  .format("parquet")
  .partitionBy("period_id")
  .saveAsTable("myTable")

Other than that, it's always a good idea to like into SparkUI and check whether the number of tasks is in a reasonable relation with the number of executors/cores. 除此之外,最好还是喜欢SparkUI并检查任务数是否与执行者/核心数有合理的关系。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM