简体繁体 English

Spark 性能将大型数据集从 Dataframe 保存到 hdfs 或 hive

[英]Spark performance a large data-set save from Dataframe to hdfs or hive

原文 2019-04-14 11:25:04 9 1 apache-spark/ hadoop/ hive/ bigdata

I have a large dataset in Spark Dataframe.我在 Spark Dataframe 中有一个大型数据集。 I want to save this data into hive.我想将这些数据保存到 hive 中。 Which of the following option will give me the best performance?以下哪个选项会给我最好的性能？

Save this data from SPARK Dataframe to hdfs and create Hive external table on top of it?将此数据从 SPARK Dataframe 保存到 hdfs 并在其上创建 Hive 外部表？
Write the data from SPARK Dataframe to Hive table directly?直接将SPARK Dataframe中的数据写入Hive表？

Which one will give the best performance and why?哪一个将提供最佳性能，为什么？

1 个解决方案

It's better to Write the data from SPARK Dataframe to Hive table directly.最好直接将 SPARK Dataframe 中的数据写入 Hive 表。

All data stored in Hive tables are stored as files in HDFS. Hive 表中存储的所有数据都作为文件存储在 HDFS 中。

Saving the data in HDFS and creating an Hive external table on top of it seems to be a double work.将数据保存在 HDFS 中并在其上创建 Hive 外部表似乎是一项双重工作。

And Spark has the feature of saving the data present in a dataframe directly to a Hive table provided you have to create a hive table with the schema which is in dataframe which is a lot easier. Spark 具有将数据帧中存在的数据直接保存到 Hive 表的功能，前提是您必须使用数据帧中的架构创建 Hive 表，这要容易得多。

Performance of Spark in writing data from dataframe to hdfs or Hive table depends on the Cluster setup you have. Spark 将数据从数据帧写入 hdfs 或 Hive 表的性能取决于您拥有的集群设置。