[英]Pig vs Hive vs Native Map Reduce
I've basic understanding on what Pig, Hive abstractions are. 我对Pig,Hive抽象是什么有基本的了解。 But I don't have a clear idea on the scenarios that require Hive, Pig or native map reduce. 但我对需要Hive,Pig或原生地图缩减的场景没有明确的想法。
I went through few articles which basically points out that Hive is for structured processing and Pig is for unstructured processing. 我经历了一些文章,其中基本上指出Hive用于结构化处理而Pig用于非结构化处理。 When do we need native map reduce? 我们什么时候需要原生地图减少? Can you point out few scenarios that can't be solved using Pig or Hive but in native map reduce? 你能指出一些使用Pig或Hive无法解决但却在原生地图中减少的场景吗?
Complex branching logic which has a lot of nested if .. else .. structures is easier and quicker to implement in Standard MapReduce, for processing structured data you could use Pangool , it also simplifies things like JOIN. 复杂的分支逻辑有很多嵌套的if .. else ..结构在Standard MapReduce中更容易和更快地实现,为了处理你可以使用Pangool的结构化数据,它也简化了JOIN这样的事情。 Also Standard MapReduce gives you full control to minimize the number of MapReduce jobs that your data processing flow requires, which translates into performance. 此外,标准MapReduce可让您完全控制,以最大限度地减少数据处理流程所需的MapReduce作业数量,从而转化为性能。 But it requires more time to code and introduce changes. 但它需要更多时间来编码和引入变化。
Apache Pig is good for structured data too, but its advantage is the ability to work with BAGs of data (all rows that are grouped on a key), it is simpler to implement things like: Apache Pig也适用于结构化数据,但它的优点是能够处理数据的BAG(所有行分组在一个键上),实现以下内容更简单:
Hive is better suited for ad-hoc queries, but its main advantage is that it has engine that stores and partitions data. Hive更适合于即席查询,但其主要优点是它具有存储和分区数据的引擎。 But its tables can be read from Pig or Standard MapReduce. 但它的表可以从Pig或Standard MapReduce中读取。
One more thing, Hive and Pig are not well suited to work with hierarchical data. 还有一件事,Hive和Pig不适合使用分层数据。
Short answer - We need MapReduce when we need very deep level and fine grained control on the way we want to process our data. 简短回答 - 当我们需要对我们想要处理数据的方式进行非常深层次和细粒度控制时,我们需要MapReduce。 Sometimes, it is not very convenient to express what we need exactly in terms of Pig and Hive queries. 有时,根据Pig和Hive查询来表达我们需要的内容并不是很方便。
It should not be totally impossible to do, what you can using MapReduce, through Pig or Hive. 这不应该是完全不可能的,你可以使用MapReduce,通过Pig或Hive。 With the level of flexibility provided by Pig and Hive you can somehow manage to achieve your goal, but it might be not that smooth. 凭借Pig和Hive提供的灵活性,您可以以某种方式实现目标,但可能并不那么顺利。 You could write UDFs or do something and achieve that. 您可以编写UDF或执行某些操作并实现此目的。
There is no clear distinction as such among the usage of these tools. 在这些工具的使用中没有明显的区别。 It totally depends on your particular use-case. 这完全取决于您的特定用例。 Based on your data and the kind of processing you need to decide which tool fits into your requirements better. 根据您的数据和处理类型,您需要更好地确定哪种工具符合您的要求。
Edit : 编辑:
Sometime ago I had a use case wherein I had to collect seismic data and run some analytics on it. 前段时间我有一个用例,我必须收集地震数据并对其进行一些分析。 The format of the files holding this data was somewhat weird. 保存这些数据的文件的格式有点奇怪。 Some part of the data was EBCDIC encoded, while rest of the data was in binary format. 部分数据是EBCDIC编码的,而其余数据是二进制格式。 It was basically a flat binary file with no delimiters like\\n or something. 它基本上是一个平面二进制文件,没有像\\ n或其他东西的分隔符。 I had a tough time finding some way to process these files using Pig or Hive. 我很难找到使用Pig或Hive处理这些文件的方法。 As a result I had to settle down with MR. 结果我不得不与MR安定下来。 Initially it took time, but gradually it became smoother as MR is really swift once you have the basic template ready with you. 最初它需要时间,但逐渐变得更加平滑,因为一旦你准备好基本模板,MR非常迅速。
So, like I said earlier it basically depends on your use case. 所以,就像我之前说的那样,它基本上取决于你的用例。 For example, iterating over each record of your dataset is really easy in Pig(just a foreach), but what if you need foreach n ?? 例如,迭代数据集的每个记录在Pig(只是一个foreach)中非常容易,但是如果你需要foreach n ?? So, when you need "that" level of control over the way you need to process your data, MR is more suitable. 因此,当您需要对您处理数据的方式进行“那种”控制时,MR更适合。
Another situation might be when you data is hierarchical rather than row-based or if your data is highly unstructured. 另一种情况可能是您的数据是分层的而不是基于行的,或者您的数据是高度非结构化的。
Metapatterns problem involving job chaining and job merging are easier to solve using MR directly rather than using Pig/Hive. 涉及作业链和作业合并的Metapatterns问题更容易使用MR而不是使用Pig / Hive来解决。
And sometimes it is very very convenient to accomplish a particular task using some xyz tool as compared to do it using Pig/hive. 有时使用一些xyz工具完成特定任务与使用Pig / hive相比非常方便。 IMHO, MR turns out to be better in such situations as well. 恕我直言,MR在这种情况下也变得更好。 For example if you need to do some statistical analyses on your BigData, R used with Hadoop streaming is probably the best option to go with. 例如,如果您需要对BigData进行一些统计分析,那么与Hadoop一起使用的R可能是最好的选择。
HTH HTH
Mapreduce: MapReduce的:
Strengths:
works both on structured and unstructured data.
good for writing complex business logic.
Weakness:
long development type
hard to achieve join functionality
Hive : 蜂巢:
Strengths:
less development time.
suitable for adhoc analysis.
easy for joins
Weakness :
not easy for complex business logic.
deals only structured data.
Pig 猪
Strengths :
Structured and unstructured data.
joins are easily written.
Weakness:
new language to learn.
converted into mapreduce.
Hive 蜂巢
Pros: 优点:
Sql like Data-base guys love that. 像数据库这样的Sql喜欢这样。 Good support for structured data. 对结构化数据的良好支持。 Currently support database schema and views like structure Support concurrent multi users, multi session scenarios. 目前支持数据库架构和视图结构支持并发多用户,多会话场景。 Bigger community support. 更大的社区支持。 Hive , Hiver server , Hiver Server2, Impala ,Centry already Hive,Hiver服务器,Hiver Server2,Impala,Centry已经
Cons: Performance degrades as data grows bigger not much to do, memory over flow issues. 缺点:随着数据变得越来越大,性能下降,内存超流问题。 cant do much with it. 不能做很多事。 Hierarchical data is a challenge. 分层数据是一项挑战。 Un-structured data requires udf like component Combination of multiple techniques could be a nightmare dynamic portions with UTDF in case of big data etc 非结构化数据需要类似udf组件多种技术的组合可能是大数据等情况下UTDF的噩梦动态部分
Pig: Pros: Great script based data flow language. Pig:优点:基于脚本的数据流语言。
Cons: 缺点:
Un-structured data requires udf like component Not a big community support 非结构化数据需要udf like component不是一个很大的社区支持
MapReudce: Pros: Dont agree with "hard to achieve join functionality", if you understand what kind of join you want to implement you can implement with few lines of code. MapReudce:优点:不同意“难以实现连接功能”,如果您了解要实现的连接类型,可以使用几行代码实现。 Most of the times MR yields better performance. 大多数时候MR产生更好的性能。 MR support for hierarchical data is great especially implement tree like structures. MR对分层数据的支持非常好,特别是实现树状结构。 Better control at partitioning / indexing the data. 更好地控制分区/索引数据。 Job chaining. 工作链。
Cons: Need to know api very well to get a better performance etc Code / debug / maintain 缺点:需要很好地了解api以获得更好的性能等代码/调试/维护
Scenarios where Hadoop Map Reduce is preferred to Hive or PIG Hadoop Map Reduce优先于Hive或PIG的场景
When you need definite driver program control 当你需要明确的驱动程序控制时
Whenever the job requires implementing a custom Partitioner 每当作业需要实现自定义分区程序时
If there already exists pre-defined library of Java Mappers or Reducers for a job 如果已存在用于作业的预定义Java Mappers或Reducers库
Pros of Pig/Hive : 猪/蜂巢的优点:
Have a look at this post for Pig Vs Hive comparison. 看看Pig Vs Hive比较的这篇文章。
All the things which we can do using PIG and HIVE can be achieved using MR (sometimes it will be time consuming though). 使用PIG和HIVE我们可以做的所有事情都可以使用MR来实现(尽管有时它会耗费时间)。 PIG and HIVE uses MR/SPARK/TEZ underneath. PIG和HIVE使用下面的MR / SPARK / TEZ。 So all the things which MR can do may or may not be possible in Hive and PIG. 因此,在Hive和PIG中,MR可以做或不可能做的所有事情。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.