简体繁体 English

Apache Pig 和 Apache Hive 有什么区别？

[英]What is the difference between Apache Pig and Apache Hive?

原文 2012-04-23 11:47:46 5 5 hadoop/ hive/ apache-pig

What is the exact difference between Pig and Hive? Pig 和 Hive 之间的确切区别是什么？ I found that both have same functional meaning because they are used for doing same work.我发现两者具有相同的功能含义，因为它们用于做相同的工作。 The only thing is implimentation which is different for both.唯一的事情是两者不同的implimentation。 So when to use and which technology?那么何时使用以及使用哪种技术？ Is there any specification for both which shows clearly the difference between both in terms of applicability and performance?是否有任何规范可以清楚地显示两者在适用性和性能方面的差异？

5 个解决方案

Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig 和 Hive 是两个位于 Hadoop 之上的项目，它们为使用 Hadoop 的 MapReduce 库提供了一种更高级的语言。 Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Apache Pig 提供了一种脚本语言来描述读取、过滤、转换、连接和写入数据等操作——这正是 MapReduce 最初设计的操作。 Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig 不是用直接使用 MapReduce 的数千行 Java 代码来表达这些操作，而是让用户用一种与 bash 或 perl 脚本不同的语言来表达它们。 Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself. Pig 非常适合原型设计和快速开发基于 MapReduce 的作业，而不是用 Java 本身编写 MapReduce 作业。

If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop".如果 Pig 是“为 Hadoop 编写脚本”，那么 Hive 是“用于 Hadoop 的 SQL 查询”。 Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. Apache Hive 提供了一种更具体、更高级的语言，用于通过运行 Hadoop 作业来查询数据，而不是直接编写 Hadoop 上多个 MapReduce 作业操作的脚本。 The language is, by design, extremely SQL-like.该语言在设计上非常类似于 SQL。 Hive is still intended as a tool for long-running batch-oriented queries over massive data; Hive 仍然旨在作为对海量数据进行长时间运行的面向批处理的查询的工具； it's not "real-time" in any sense.它在任何意义上都不是“实时”的。 Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems;对于习惯于 SQL 类查询和商业智能系统的分析师和业务开发类型，Hive 是一款出色的工具； it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.它将让他们轻松地利用您闪亮的新 Hadoop 集群来执行临时查询或跨存储在上述存储系统中的数据生成报告数据。

From a purely engineering point of view, I find PIG both easier to write and maintain than SQL-like languages.从纯粹的工程角度来看，我发现 PIG 比类 SQL 语言更容易编写和维护。 It is procedural, so you apply a bunch of relations to your data one-by-one, and if something fails you can easily debug at intermediate steps, and even have a command called “illustrate” which uses an algorithm to sample some data matching your relation.它是程序性的，因此您可以将一堆关系一个一个地应用于数据，如果出现问题，您可以在中间步骤轻松调试，甚至还有一个名为“illustrate”的命令，它使用算法对一些数据匹配进行采样你的关系。 I'd say for jobs with complex logic, this is definitely much more convenient than Hive, but for simple stuff the gain is probably minimal.我会说对于具有复杂逻辑的作业，这绝对比 Hive 方便得多，但对于简单的东西，收益可能很小。

Regarding interfacing, I find that PIG offers a lot of flexibility compared to Hive.关于接口，我发现与 Hive 相比，PIG 提供了很多灵活性。 You don't have a notion of table in PIG so you manipulate files directly, and you can define loader to load it into pretty much any format very easily with loader UDFs, without having to go through the table loading stage before you can do your transformations.您在 PIG 中没有表格的概念，因此您可以直接操作文件，并且您可以定义加载程序以使用加载程序 UDF 轻松将其加载为几乎任何格式，而无需经过表格加载阶段才能执行您的操作转换。 They have a nice feature in the recent versions of PIG where you can use dynamic invokers, ie use pretty much any Java method directly in your PIG script, without having to write a UDF.它们在 PIG 的最新版本中有一个很好的功能，您可以在其中使用动态调用程序，即直接在您的 PIG 脚本中使用几乎任何 Java 方法，而无需编写 UDF。

For performance/optimization, from what I've seen you can directly control in PIG the type of join and grouping algorithm you want to use (I believe 3 or 4 different algorithms for each).对于性能/优化，根据我所见，您可以直接在 PIG 中控制要使用的连接和分组算法的类型（我相信每种算法有 3 或 4 种不同的算法）。 I've personally never used it, but as you're writing demanding algorithms it could probably be useful to be able to decide what to do instead of relying on the optimizer as it's the case in Hive.我个人从未使用过它，但是当您编写要求很高的算法时，能够决定要做什么而不是像在 Hive 中那样依赖优化器可能会很有用。 So I wouldn't say it necessarily performs better than Hive, but in cases where the optimizer makes the wrong decision, you have the option to choose what algorithm to use and have more control on what happens.所以我不会说它的性能一定比 Hive 好，但是在优化器做出错误决定的情况下，您可以选择使用哪种算法并对发生的事情有更多的控制权。

One of the cool things I did lately was splits: you can split your execution flow and apply different relations to each split.我最近做的一件很酷的事情是拆分：你可以拆分你的执行流程并对每个拆分应用不同的关系。 So you can have a non-linear dataset, split it based on a field, and apply a different processing to each part, and maybe join the results together in the end, all this in the same script.因此，您可以拥有一个非线性数据集，根据字段对其进行拆分，并对每个部分应用不同的处理，最后可能将结果连接在一起，所有这些都在同一个脚本中。 I don't think you can do this in Hive, you'd have to write different queries for each case, but I may be wrong.我不认为你可以在 Hive 中做到这一点，你必须为每种情况编写不同的查询，但我可能是错的。

One thing to note also is that you can increment counters in PIG.还需要注意的一件事是您可以在 PIG 中增加计数器。 Currently you can only do this in PIG UDFs though.目前，您只能在 PIG UDF 中执行此操作。 I don't think you can use counters in Hive.我认为您不能在 Hive 中使用计数器。

And there are some nice projects that allow you to interface PIG with Hive as well (like HCatalog), so you can basically read data from a hive table, or write data to a hive table (or both) by simply changing your loader in the script.并且有一些不错的项目允许您将 PIG 与 Hive 接口（如 HCatalog），因此您基本上可以通过简单地在脚本。 Supports dynamic partitions as well.也支持动态分区。

Apache Pig is a platform for analyzing large data sets. Apache Pig 是一个用于分析大型数据集的平台。 Pig's language, Pig Latin, is a simple query algebra that lets you express data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig 的语言 Pig Latin 是一种简单的查询代数，可让您表达数据转换，例如合并数据集、过滤它们以及将函数应用于记录或记录组。 Users can create their own functions to do special-purpose processing.用户可以创建自己的函数来做特殊用途的处理。

Pig Latin queries execute in a distributed fashion on a cluster. Pig Latin 查询以分布式方式在集群上执行。 Our current implementation compiles Pig Latin programs into Map-Reduce jobs, and executes them using Hadoop cluster.我们当前的实现将 Pig Latin 程序编译为 Map-Reduce 作业，并使用 Hadoop 集群执行它们。

https://cwiki.apache.org/confluence/display/PIG/Index%3bjsessionid=F92DF7021837B3DD048BF9529A434FDA https://cwiki.apache.org/confluence/display/PIG/Index%3bjsessionid=F92DF7021837B3DD048BF9529A434FDA

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive 是适用于 Hadoop 的数据仓库系统，有助于轻松进行数据汇总、即席查询以及对存储在 Hadoop 兼容文件系统中的大型数据集的分析。 Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive 提供了一种机制来将结构投影到此数据上并使用称为 HiveQL 的类似 SQL 的语言查询数据。 At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.同时，这种语言还允许传统的 map/reduce 程序员在 HiveQL 中表达这种逻辑不方便或效率低下时，插入他们自定义的 mapper 和 reducer。

https://cwiki.apache.org/Hive/ https://cwiki.apache.org/Hive/

What is the exact difference between Pig and Hive? Pig 和 Hive 之间的确切区别是什么？ I found that both have same functional meaning because they are used for doing same work.我发现两者具有相同的功能含义，因为它们用于做相同的工作。

Have a look at Pig Vs Hive Comparison in a nut shell from dezyre article在dezyre文章的坚果壳中查看Pig 与 Hive 的比较

Hive scores over PIG in Partitions, Server, Web interface & JDBC/ODBC support . Hive在分区、服务器、Web 界面和 JDBC/ODBC 支持方面优于PIG 。

Some differences:一些差异：

Hive is best for structured Data & PIG is best for semi structured data Hive最适合结构化数据，而PIG最适合半结构化数据
Hive used for reporting & PIG for programming Hive用于报告和PIG用于编程
Hive used as a declarative SQL & PIG used as procedural language Hive用作声明性 SQL & PIG用作过程语言
Hive supports partitions & PIG does not Hive支持分区而PIG不支持
Hive can start an optional thrift based server & PIG can't Hive可以启动一个可选的基于节俭的服务器，而PIG不能
Hive defines tables before hand (schema) + stores schema information in database and PIG don't have dedicated metadata of database Hive 预先定义表（模式） + 将模式信息存储在数据库中，而PIG没有数据库的专用元数据
Hive does not support Avro but PIG does Hive不支持 Avro 但PIG支持
Pig also supports additional COGROUP feature for performing outer joins but hive does not. Pig还支持额外的COGROUP功能来执行外连接，但 hive 不支持。 But both Hive & PIG can join, order & sort dynamically但是Hive 和 PIG都可以动态连接、排序和排序

So when to use and which technology?那么何时使用以及使用哪种技术？

Above difference clarifies your query.以上差异澄清了您的查询。

HIVE : Structured data, SQL like queries and used for reporting purpose HIVE ：结构化数据、SQL 之类的查询并用于报告目的

PIG : Semi-structured data, program a work-flow involving a sequence of activities for Map Reduce jobs. PIG ：半结构化数据，对涉及 Map Reduce 作业的一系列活动的工作流进行编程。

Regarding performance of job, both HIVE and PIG are slow compared to traditional Map Reduce job.关于作业的性能，与传统的 Map Reduce 作业相比， HIVE和PIG都很慢。 Reason : Finally Hive or PIG scripts have to be converted into a series of Map Reduce jobs. Reason ：最后必须将 Hive 或 PIG 脚本转换为一系列 Map Reduce 作业。

Have a look at related SE question:看看相关的SE问题：

Pig vs Hive vs Native Map Reduce Pig vs Hive vs Native Map Reduce

The main difference is PIG is a data flow language and Hive is data warehouse.主要区别在于 PIG 是一种数据流语言，而 Hive 是一种数据仓库。 As PIG can be used similar as a step by step procedural language.由于 PIG 可以用作逐步过程语言。 But HIVE is used as a declarative language.但是 HIVE 被用作声明性语言。 PIG can be used for getting online streaming unstructured data. PIG 可用于获取在线流式非结构化数据。 But HIVE can only access structured data and it can also access data from RDBMS databases such as SQL, NOSQL by using JDBC and ODBC drivers.但是HIVE只能访问结构化数据，也可以通过JDBC和ODBC驱动访问SQL、NOSQL等RDBMS数据库中的数据。 PIG can convert data into Avro format but PIG can't. PIG 可以将数据转换为 Avro 格式，但 PIG 不能。 PIG can't create partitions but HIVE can do it. PIG 不能创建分区，但 HIVE 可以。 As HIVE is top of PIG that's why HIVE can only access the data once it is processed by PIG.由于 HIVE 是 PIG 的顶部，这就是为什么 HIVE 只能访问由 PIG 处理后的数据。 It depends when we have to use PIG and HIVE if you are working structured, relational data then we can use HIVE else we can use PIG.这取决于我们何时必须使用 PIG 和 HIVE，如果您正在处理结构化的关系数据，那么我们可以使用 HIVE，否则我们可以使用 PIG。 By PIG we can communicate with ETL tools but it takes more time compared with hive.通过 PIG，我们可以与 ETL 工具进行通信，但与 hive 相比需要更多时间。 But it is easy in PIG rather HIVE because in HIVE we have to create table before processing the data.但是在 PIG 而不是 HIVE 中很容易，因为在 HIVE 中我们必须在处理数据之前创建表。