简体   繁体   English

Pig vs Hive vs Native Map Reduce

[英]Pig vs Hive vs Native Map Reduce

I've basic understanding on what Pig, Hive abstractions are. 我对Pig,Hive抽象是什么有基本的了解。 But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce. 但我对需要Hive,Pig或原生地图缩减的场景没有明确的想法。

I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. 我经历了一些文章,其中基本上指出Hive用于结构化处理而Pig用于非结构化处理。 When do we need native map reduce? 我们什么时候需要原生地图减少? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce? 你能指出一些使用Pig或Hive无法解决但却在原生地图中减少的场景吗?

Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool , it also simplifies things like JOIN. 复杂的分支逻辑有很多嵌套的if .. else ..结构在Standard MapReduce中更容易和更快地实现,为了处理你可以使用Pangool的结构化数据,它也简化了JOIN这样的事情。 Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. 此外,标准MapReduce可让您完全控制,以最大限度地减少数据处理流程所需的MapReduce作业数量,从而转化为性能。 But it requires more time to code and introduce changes. 但它需要更多时间来编码和引入变化。

Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like: Apache Pig也适用于结构化数据,但它的优点是能够处理数据的BAG(所有行分组在一个键上),实现以下内容更简单:

  1. Get top N elements for each group; 获得每组的前N个元素;
  2. Calculate total per each group and than put that total against each row in the group; 计算每组的总数,然后将该总数与组中的每一行相对应;
  3. Use Bloom filters for JOIN optimisations; 使用Bloom过滤器进行JOIN优化;
  4. Multiquery support (it is when PIG tries to minimise the number on MapReduce Jobs by doing more stuff in a single Job) 多查询支持(当PIG尝试通过在单个作业中执行更多操作来最小化MapReduce作业的数量时)

Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. Hive更适合于即席查询,但其主要优点是它具有存储和分区数据的引擎。 But its tables can be read from Pig or Standard MapReduce. 但它的表可以从Pig或Standard MapReduce中读取。

One more thing, Hive and Pig are not well suited to work with hierarchical data. 还有一件事,Hive和Pig不适合使用分层数据。

Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. 简短回答 - 当我们需要对我们想要处理数据的方式进行非常深层次和细粒度控制时,我们需要MapReduce。 Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries. 有时,根据Pig和Hive查询来表达我们需要的内容并不是很方便。

It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. 这不应该是完全不可能的,你可以使用MapReduce,通过Pig或Hive。 With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. 凭借Pig和Hive提供的灵活性,您可以以某种方式实现目标,但可能并不那么顺利。 You could write UDFs or do something and achieve that. 您可以编写UDF或执行某些操作并实现此目的。

There is no clear distinction as such among the usage of these tools. 在这些工具的使用中没有明显的区别。 It totally depends on your particular use-case. 这完全取决于您的特定用例。 Based on your data and the kind of processing you need to decide which tool fits into your requirements better. 根据您的数据和处理类型,您需要更好地确定哪种工具符合您的要求。

Edit : 编辑:

Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. 前段时间我有一个用例,我必须收集地震数据并对其进行一些分析。 The format of the files holding this data was somewhat weird. 保存这些数据的文件的格式有点奇怪。 Some part of the data was EBCDIC encoded, while rest of the data was in binary format. 部分数据是EBCDIC编码的,而其余数据是二进制格式。 It was basically a flat binary file with no delimiters like\\n or something. 它基本上是一个平面二进制文件,没有像\\ n或其他东西的分隔符。 I had a tough time finding some way to process these files using Pig or Hive. 我很难找到使用Pig或Hive处理这些文件的方法。 As a result I had to settle down with MR. 结果我不得不与MR安定下来。 Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you. 最初它需要时间,但逐渐变得更加平滑,因为一旦你准备好基本模板,MR非常迅速。

So, like I said earlier it basically depends on your use case. 所以,就像我之前说的那样,它基本上取决于你的用例。 For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n ?? 例如,迭代数据集的每个记录在Pig(只是一个foreach)中非常容易,但是如果你需要foreach n ?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable. 因此,当您需要对您处理数据的方式进行“那种”控制时,MR更适合。

Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured. 另一种情况可能是您的数据是分层的而不是基于行的,或者您的数据是高度非结构化的。

Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive. 涉及作业链和作业合并的Metapatterns问题更容易使用MR而不是使用Pig / Hive来解决。

And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. 有时使用一些xyz工具完成特定任务与使用Pig / hive相比非常方便。 IMHO, MR turns out to be better in such situations as well. 恕我直言,MR在这种情况下也变得更好。 For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with. 例如,如果您需要对BigData进行一些统计分析,那么与Hadoop一起使用的R可能是最好的选择。

HTH HTH

Mapreduce: MapReduce的:

Strengths:
      works both on structured and unstructured data.
      good for writing complex business logic.

Weakness:
     long development type
     hard to achieve join functionality

Hive : 蜂巢:

Strengths:
     less development time.
     suitable for adhoc analysis.
     easy for joins

Weakness :
     not easy for complex business logic.
     deals only structured data.

Pig

Strengths :
      Structured and unstructured data.
      joins are easily written.

Weakness:
     new language to learn.
     converted into mapreduce.

Hive 蜂巢

Pros: 优点:

Sql like Data-base guys love that. 像数据库这样的Sql喜欢这样。 Good support for structured data. 对结构化数据的良好支持。 Currently support database schema and views like structure Support concurrent multi users, multi session scenarios. 目前支持数据库架构和视图结构支持并发多用户,多会话场景。 Bigger community support. 更大的社区支持。 Hive , Hiver server , Hiver Server2, Impala ,Centry already Hive,Hiver服务器,Hiver Server2,Impala,Centry已经

Cons: Performance degrades as data grows bigger not much to do, memory over flow issues. 缺点:随着数据变得越来越大,性能下降,内存超流问题。 cant do much with it. 不能做很多事。 Hierarchical data is a challenge. 分层数据是一项挑战。 Un-structured data requires udf like component Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc 非结构化数据需要类似udf组件多种技术的组合可能是大数据等情况下UTDF的噩梦动态部分

Pig: Pros: Great script based data flow language. Pig:优点:基于脚本的数据流语言。

Cons: 缺点:

Un-structured data requires udf like component Not a big community support 非结构化数据需要udf like component不是一个很大的社区支持

MapReudce: Pros: Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code. MapReudce:优点:不同意“难以实现连接功能”,如果您了解要实现的连接类型,可以使用几行代码实现。 Most of the times MR yields better performance. 大多数时候MR产生更好的性能。 MR support for hierarchical data is great especially implement tree like structures. MR对分层数据的支持非常好,特别是实现树状结构。 Better control at partitioning / indexing the data. 更好地控制分区/索引数据。 Job chaining. 工作链。

Cons: Need to know api very well to get a better performance etc Code / debug / maintain 缺点:需要很好地了解api以获得更好的性能等代码/调试/维护

Scenarios where Hadoop Map Reduce is preferred to Hive or PIG Hadoop Map Reduce优先于Hive或PIG的场景

  1. When you need definite driver program control 当你需要明确的驱动程序控制时

  2. Whenever the job requires implementing a custom Partitioner 每当作业需要实现自定义分区程序时

  3. If there already exists pre-defined library of Java Mappers or Reducers for a job 如果已存在用于作业的预定义Java Mappers或Reducers库

  4. If you require good amount of testability when combining lots of large data sets 如果在组合大量数据集时需要大量可测试性
  5. If the application demands legacy code requirements that command physical structure 如果应用程序需要命令物理结构的遗留代码要求
  6. If the job requires optimization at a particular stage of processing by making the best use of tricks like in-mapper combining 如果作业需要在特定的处理阶段进行优化,那么最好使用像映射器组合这样的技巧
  7. If the job has some tricky usage of distributed cache (replicated join), cross products, groupings or joins 如果作业有一些棘手的分布式缓存(复制连接),交叉产品,分组或连接

Map reduce / Pig / Hive之间的比较

Pros of Pig/Hive : 猪/蜂巢的优点:

  1. Hadoop MapReduce requires more development effort than Pig and Hive. Hadoop MapReduce需要比Pig和Hive更多的开发工作。
  2. Pig and Hive coding approaches are slower than a fully tuned Hadoop MapReduce program. Pig和Hive编码方法比完全调优的Hadoop MapReduce程序慢。
  3. When using Pig and Hive for executing jobs, Hadoop developers need not worry about any version mismatch. 使用Pig和Hive执行作业时,Hadoop开发人员无需担心任何版本不匹配。
  4. There is very limited possibility for the developer to write java level bugs when coding in Pig or Hive. 在Pig或Hive中编码时,开发人员编写java级别错误的可能性非常有限。

Have a look at this post for Pig Vs Hive comparison. 看看Pig Vs Hive比较的这篇文章。

All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). 使用PIG和HIVE我们可以做的所有事情都可以使用MR来实现(尽管有时它会耗费时间)。 PIG and HIVE uses MR/SPARK/TEZ underneath. PIG和HIVE使用下面的MR / SPARK / TEZ。 So all the things which MR can do may or may not be possible in Hive and PIG. 因此,在Hive和PIG中,MR可以做或不可能做的所有事情。

Here is the great comparison. 是一个很好的比较。 It specifies all the use case scenarios. 它指定了所有用例场景。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM