[英]Spark/ Scala- Faster Way to Load Dataframe in Hive Table
I have a data frame which I intend to write as a Hive partitioned table. 我有一个要写为Hive分区表的数据框。 The code I use to that end is 为此,我使用的代码是
for(i <- 0 until 10)
{
myDf = hiveContext.sql("select * from srcTable where col = "+i)
myDf.write.mode("append").format("parquet")
.partitionBy("period_id").saveAsTable("myTable")
}
myDf
will contain a different set of data in every iteration (I have just shown an oversimplified way of how I get values in myDf
) myDf
在每次迭代中将包含一组不同的数据(我刚刚展示了一种过分简化的方式来获取myDf
值)
The myDf.write
takes about 5 minutes to load 120,000 rows of data. myDf.write
大约需要5分钟来加载120,000行数据。 Is there any way I could further reduce the time taken to write all this data? 有什么办法可以进一步减少写入所有这些数据所需的时间?
first, why do you iterate and not just lead/save all data at once? 首先,为什么要迭代,而不仅仅是一次领导/保存所有数据? Second, I could imagine that with your code, you write too many (small) files, you could check that on the filesystem. 其次,我可以想象使用您的代码编写太多(小的)文件,然后可以在文件系统上进行检查。 Normally I repartition my dataframe according to the same column which I use as partition-column of the DataFrameWriter
, like this I get only 1 file per partition (as long as it isn't too large, otherwise HDFS will automatically split the file): 通常,我会根据与DataFrameWriter
分区列相同的列对数据帧进行分区,这样我每个分区仅获得1个文件(只要它不是太大,否则HDFS会自动拆分文件):
val cols = (0 until 10)
hiveContext.table("srcTable")
.where($"col".isin(cols:_*))
.repartition($"period_id")
.write
.format("parquet")
.partitionBy("period_id")
.saveAsTable("myTable")
Other than that, it's always a good idea to like into SparkUI and check whether the number of tasks is in a reasonable relation with the number of executors/cores. 除此之外,最好还是喜欢SparkUI并检查任务数是否与执行者/核心数有合理的关系。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.