简体繁体 English

hive、pig、map-reduce 用例之间的区别

[英]Difference between hive,pig,map-reduce use cases

原文 2014-10-29 15:23:48 2 4 hadoop/ mapreduce/ hive/ apache-pig

Difference between map-reduce ,hive ,pig map-reduce、hive、pig 之间的区别

pig : its a data flow language, it can work on any data basically used to convert semi structure ,unstructured data to structure so that can be used in hive advance analytics using windowing function etc. pig：它是一种数据流语言，它可以处理任何基本上用于将半结构化、非结构化数据转换为结构化的数据，以便可以使用窗口函数等用于 hive 高级分析。

Hive : Work on structure data and provide sql type query language . Hive：处理结构数据并提供 sql 类型的查询语言。

I know at back end both pig and hive uses map -reduces .我知道在后端， pig 和 hive 都使用 map -reduces 。

I know map-reduce can be good tool for programmer ,hive or pig for sql guy我知道 map-reduce 可以成为程序员的好工具，对于 sql 人员来说，hive 或 pig

I just want to know is there any specific use cases where we go for hive,pig and map-reduce我只想知道是否有任何特定的用例可以用于 hive、pig 和 map-reduce

basically we decide that we have to use pig here hive here or we must use map -reduce .基本上我们决定我们必须在这里使用 pig 在这里 hive 或者我们必须使用 map -reduce 。

4 个解决方案

Map-Reduce: Has better performance than pig or hive but requires more development time. Map-Reduce：比 pig 或 hive 具有更好的性能，但需要更多的开发时间。

PIg: Less development time but poor performance when compared to map-reduce. PIg：与 map-reduce 相比，开发时间较短，但性能较差。

Hve: SQL type language with some good features like partitioning and bucketing to improve performance reads.Also, hive enforces schema on read. Hve：SQL 类型语言，具有一些不错的功能，例如分区和分桶以提高读取性能。此外，hive 强制读取模式。

Pig is used to format your unstructured/semi structure data format.Lets say you have a timestamp in your data which is not as per Hive timestamp format.You can convert same using pigUDF and format your data.This is just a example to explain.You can do many more things using Pig. Pig 用于格式化您的非结构化/半结构数据格式。假设您的数据中有一个时间戳，这与 Hive 时间戳格式不同。您可以使用 pigUDF 进行转换并格式化您的数据。这只是一个解释示例。你可以使用 Pig 做更多的事情。

Hive is basically used for structured data .This maynot work well with unstructured data.This takes more time to execute as it converts into Mapreduce job.I suggest you to use impala which is much faster than hive. Hive 基本上用于结构化数据。这可能不适用于非结构化数据。这需要更多时间来执行，因为它转换为 Mapreduce 作业。我建议您使用比 hive 快得多的impala。

Pig is a data flow language. Pig 是一种数据流语言。 This means that you can not use if statements or loops.这意味着您不能使用 if 语句或循环。 If you need to do a lot of repetition, it would be preferable to learn mapreduce.如果你需要做很多重复，最好学习mapreduce。

You are able to get around this by embedding pig into a python script but this would take even longer since it would have to load all the jar files with every iteration of the loop.您可以通过将 pig 嵌入 python 脚本来解决这个问题，但这需要更长的时间，因为它必须在循环的每次迭代中加载所有 jar 文件。

Basically it boils down to how much time you spend prototyping vs. how much production work you have.基本上它归结为您花多少时间进行原型制作与您有多少生产工作。 If you are a data scientist or an analyst, most of your work is new projects that require a lot of prototyping.如果您是数据科学家或分析师，您的大部分工作都是需要大量原型设计的新项目。 This means that you care about getting results fast.这意味着您关心快速获得结果。 Then you would prefer Pig or Hive.那么你会更喜欢 Pig 或 Hive。 If you are in a development team, you want to build robust code based on agreed upon methodology that does not need to be tested and then you would prefer mapreduce.如果您在一个开发团队中，您希望基于不需要测试的商定方法构建健壮的代码，那么您会更喜欢 mapreduce。

There are companies like Cloudera that provide a package of Pig, Hive, and other Hadoop tools so you wouldn't have to choose between the two. Cloudera 等公司提供了 Pig、Hive 和其他 Hadoop 工具包，因此您不必在两者之间进行选择。

Map Reduce is a inner component of hadoop, other Pig and hive are hadoop eco systems it means run on the top of hadoop. Map Reduce 是 hadoop 的内部组件，其他 Pig 和 hive 是 hadoop 生态系统，这意味着运行在 hadoop 之上。 The purpose of both mapreduce, pig and hive purpose is process the vast amount of data in different manner. mapreduce、pig 和 hive 的目的都是以不同的方式处理大量数据。

Mapreduce : apache implemented it. Mapreduce : apache 实现了它。 highly recommendable to process entire data, it's time consume and required program skills like java (highly recommendable), pyghon, ruby and other programming languages.强烈推荐处理整个数据，它的时间消耗和所需的程序技能，如 java（强烈推荐）、pyghon、ruby 和其他编程语言。 total data aggregate and sort by using mapper and reducer functions.使用 mapper 和 reducer 函数聚合和排序总数据。 Hadoop use it by default. Hadoop 默认使用它。

Hive : Facebook implemented it. Hive ：Facebook 实现了它。 most of the analysts especially bigdata analysts use this tool to analyze the data especially structure data.大多数分析师尤其是大数据分析师使用此工具来分析数据，尤其是结构数据。 Backend this hive tool use mapreduce to be processed.后端这个hive工具使用mapreduce进行处理。 Internally Hive use special language called HQL, It's subset of SQL language. Hive 内部使用称为 HQL 的特殊语言，它是 SQL 语言的子集。 Who is wellever in SQL, they can goes with Hive.精通 SQL 的人可以使用 Hive。 It's highly recommended to the Datawarehouse oriented projects.强烈推荐给面向数据仓库的项目。 Much difficult to process un structured especially schema-less data.处理非结构化特别是无模式数据非常困难。

Pig: Pig is a scripting language, implemented by Yahoo. Pig： Pig 是一种脚本语言，由 Yahoo 实现。 The main difference between pig and Hive is pig can process any type of data, either structured or unstructured data. Pig 和 Hive 之间的主要区别在于 Pig可以处理任何类型的数据，无论是结构化数据还是非结构化数据。 It means it's highly recommendable for streaming data like satellite generated data, live events, schema-less data etc. Pig first load the data later programmer write a program depends on data to make it structured.这意味着它非常适合流数据，如卫星生成的数据、实时事件、无模式数据等。 Pig 首先加载数据，然后程序员编写依赖于数据的程序以使其结构化。 Who is expert in programming languages they will choose this Hadoop ecosystems.谁是编程语言专家，他们将选择这个 Hadoop 生态系统。