Spark / Scala-在Hive表中加载数据框的更快方法

Question

I have a data frame which I intend to write as a Hive partitioned table. 我有一个要写为Hive分区表的数据框。 The code I use to that end is 为此，我使用的代码是

for(i <- 0 until 10)
{
  myDf = hiveContext.sql("select * from srcTable where col = "+i)
  myDf.write.mode("append").format("parquet")
        .partitionBy("period_id").saveAsTable("myTable")
}

myDf will contain a different set of data in every iteration (I have just shown an oversimplified way of how I get values in myDf ) myDf在每次迭代中将包含一组不同的数据（我刚刚展示了一种过分简化的方式来获取myDf值）

The myDf.write takes about 5 minutes to load 120,000 rows of data. myDf.write大约需要5分钟来加载120,000行数据。 Is there any way I could further reduce the time taken to write all this data? 有什么办法可以进一步减少写入所有这些数据所需的时间？

Answer 1

first, why do you iterate and not just lead/save all data at once? 首先，为什么要迭代，而不仅仅是一次领导/保存所有数据？ Second, I could imagine that with your code, you write too many (small) files, you could check that on the filesystem. 其次，我可以想象使用您的代码编写太多（小的）文件，然后可以在文件系统上进行检查。 Normally I repartition my dataframe according to the same column which I use as partition-column of the DataFrameWriter , like this I get only 1 file per partition (as long as it isn't too large, otherwise HDFS will automatically split the file): 通常，我会根据与DataFrameWriter分区列相同的列对数据帧进行分区，这样我每个分区仅获得1个文件（只要它不是太大，否则HDFS会自动拆分文件）：

val cols = (0 until 10)

hiveContext.table("srcTable")
  .where($"col".isin(cols:_*))
  .repartition($"period_id")
  .write
  .format("parquet")
  .partitionBy("period_id")
  .saveAsTable("myTable")

Other than that, it's always a good idea to like into SparkUI and check whether the number of tasks is in a reasonable relation with the number of executors/cores. 除此之外，最好还是喜欢SparkUI并检查任务数是否与执行者/核心数有合理的关系。

Spark / Scala-在Hive表中加载数据框的更快方法

问题描述

1 个解决方案

解决方案1
0 2017-03-24 09:00:55

Spark / Scala-在Hive表中加载数据框的更快方法

问题描述

1 个解决方案

解决方案1 0 2017-03-24 09:00:55

解决方案1
0 2017-03-24 09:00:55