简体   繁体   English

Hive CLI 如何从 HDFS 检索巨大的结果文件?

[英]How does Hive CLI retrieve huge result files from HDFS?

After I execute a hive query via CLI like below:通过 CLI 执行 hive 查询后,如下所示:

$ hive -e QUERY > output.txt
  1. Hive client will compile the QUERY and send it to Hadoop cluster. Hive 客户端将编译 QUERY 并将其发送到 Hadoop 集群。
  2. Hadoop executes some jobs and outputs result to a file (assume only 1 reducer) at HDFS. Hadoop 执行一些作业并将结果输出到 HDFS 上的文件(假设只有 1 个减速器)。
  3. Then Hive client will retrieve this single file, extract it, and output to local STDOUT.然后 Hive 客户端将检索这个单个文件,提取它,并输出到本地 STDOUT。

The flow looks like below graph:流程如下图所示:

==============
Hadoop Cluster
==============
  |         |
  |         |
  |     2. output RESULT as a single .gz file at HDFS because of 1 reducer
  |         |
  |         |
1. QUERY    |
  |         |
  |     3. Hive retrieves the RESULT as stream or a whole file ?
  |        If as a whole file, what happens when file size > memory size ?
  |         |
  |         |
  ===========
  Hive Client
  ===========
      |
      |
  4. Client outputs RESULT to stdout which is redirected to a file
      |
      |
 ===========
 Output File
 ===========

My question is: If the single result file at HDFS is super big, even bigger than my local physical memory size, how does Hive client handle it ?我的问题是:如果 HDFS 上的单个结果文件超大,甚至比我的本地物理内存还大,那么 Hive 客户端如何处理?

Does Hive client retrieve the file Hive 客户端是否检索文件

  1. as a stream ?作为一个流?
  2. put it to some temporary swap file ?把它放到一些临时交换文件中?
  3. or something else ?或者是其他东西 ?

You are getting the results as a stream, so if you haven't redirected the output, no temporary files are included in your procedure.您正在以流的形式获取结果,因此如果您没有重定向输出,则您的过程中不会包含任何临时文件。 You could imagine it as doing hadoop fs -cat /THE/RESULT/FILE/OF/YOUR/HIVE/REQUEST你可以把它想象成在做hadoop fs -cat /THE/RESULT/FILE/OF/YOUR/HIVE/REQUEST

If the result will be a large data, you could re put them on an hdfs location :如果结果将是一个大数据,您可以将它们重新放在 hdfs 位置:

$ hive -e QUERY | hadoop fs -put - /HDFS/LOCATION

But here you should pay attention to the network as it might be saturated但是这里你应该注意网络,因为它可能已经饱和了

Another alternative is to store the data immidiately to another Hive table, in this way Hive will do all the job for you and no reuslts will be streamed/copied to your local machine另一种选择是将数据立即存储到另一个 Hive 表中,这样 Hive 将为您完成所有工作,并且不会将任何结果流式传输/复制到您的本地机器

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM