简体   繁体   English

Hadoop的Hive / Pig,HDFS和MapReduce关系

[英]Hadoop's Hive/Pig, HDFS and MapReduce relationship

My understanding of Apache Hive is that its a SQL-like tooling layer for querying Hadoop clusters. Apache Hive的 理解是,它是一个类似于SQL的工具层,用于查询Hadoop集群。 My understanding of Apache Pig is that its a procedural language for querying Hadoop clusters. Apache Pig的 理解是它是一种用于查询Hadoop集群的过程语言。 So, if my understanding is correct, Hive and Pig seem like two different ways of solving the same problem. 因此,如果我的理解是正确的,Hive和Pig 似乎是解决同一问题的两种不同方式。

My problem, however, is that I don't understand the problem they are both solving in the first place! 但是,我的问题是,我一开始都不了解他们都在解决的问题!

Say we have a DB (relational, NoSQL, doesn't matter) that feeds data into HDFS so that a particular MapReduce job can be run against that input data: 假设我们有一个DB(关系型,NoSQL无关紧要),该数据库将数据馈入HDFS,以便可以针对该输入数据运行特定的MapReduce作业:

在此处输入图片说明

I'm confused as to which system Hive/Pig are querying! 我对于Hive / Pig正在查询哪个系统感到困惑! Are they querying the database? 他们在查询数据库吗? Are they querying the raw input data stored in the DataNodes on HDFS? 他们是否在查询存储在HDFS上DataNodes中的原始输入数据? Are they running little ad hoc, on-the-fly MR jobs and reporting their results/outputs? 他们是否很少执行即时的MR作业并报告其结果/输出?

What is the relationship between these query tools, the MR job input data stored on HDFS, and the MR job itself? 这些查询工具,HDFS上存储的MR作业输入数据和MR作业本身之间是什么关系?

Apache Pig and Apache Hive load data from the HDFS unless you run it locally, in which case it will load it locally. 除非您在本地运行,否则Apache Pig和Apache Hive将从HDFS加载数据,在这种情况下,它将在本地加载数据。 How does it get the data from a DB? 它如何从数据库获取数据? It does not. 它不是。 You need other framework to export the data in your traditional DB into your HDFS, such as Sqoop. 您需要其他框架来将传统数据库中的数据导出到HDFS中,例如Sqoop。

Once you have the data in your HDFS, you can start working with Pig and Hive. 一旦将数据保存在HDFS中,就可以开始使用Pig和Hive。 They never query a DB. 他们从不查询数据库。 In Apache Pig, for example, you could load your data using a Pig loader: 例如,在Apache Pig中,您可以使用Pig加载器加载数据:

A = LOAD 'path/in/your/HDFS' USING PigStorage('\t');

As for Hive, you need to create a table and then load the data into the table: 对于Hive,您需要创建一个表,然后将数据加载到表中:

LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1;

Again, the data must be in the HDFS. 同样,数据必须位于HDFS中。

As to how it works, it depends. 至于如何运作,这取决于。 Traditionally it has always worked with a MapReduce execution engine. 传统上,它始终与MapReduce执行引擎一起使用。 Both Hive and Pig parse the statements you write in PigLatin or HiveQL and translate it into an execution plan consisting of a certain number of MapReduce jobs, depending on the plan. Hive和Pig都会解析您在PigLatin或HiveQL中编写的语句,并将其转换为执行计划,该计划由一定数量的MapReduce作业组成,具体取决于计划。 However, now it can also translate it into Tez, a new execution engine which perhaps is too new to work correctly. 但是,现在它也可以将其转换为Tez,这是一个新的执行引擎,可能太新而无法正常工作。

Why the need of Pig or Hive? 为什么需要猪或蜂巢? Well, you really don't need these frameworks. 好吧,您真的不需要这些框架。 Everything they can do, you can do it as well writing your own MapReduce or Tez jobs. 他们可以做的所有事情,都可以做到,也可以编写自己的MapReduce或Tez作业。 However, writing for instance a JOIN operation in MapReduce might take hundreds or thousands of lines of code (really), while it is only one single line of code in Pig or Hive. 但是,例如,在MapReduce中编写JOIN操作可能会占用数百或数千行代码(确实),而Pig或Hive中只有一行代码。

I dont think you can query any data with Hive/Pig without actually adding to them. 我不认为您可以使用Hive / Pig查询任何数据而无需实际添加。 So first you need to add data. 因此,首先您需要添加数据。 And this data can be coming from any place and you just give the path for the data to be picked or add directly to them. 而且这些数据可以来自任何地方,您只需提供数据被选择或直接添加到其中的路径。 Once data is in place, the query fetches the data only from those tables. 数据到位后,查询仅从那些表中获取数据。

Underneath, they use map reduce as a tool to do the process. 在下面,他们使用map reduce作为执行此过程的工具。 If you just have on the go data lying somewhere and need analysis, you can directly go to map redue and define your own logic. 如果您只是随时随地放置数据并需要分析,则可以直接去映射redue并定义自己的逻辑。 Hive is mostly at the SQL front. Hive主要在SQL方面。 So you get querying features similar to SQL, and at the backend, map reduce does the job. 因此,您可以获得类似于SQL的查询功能,并且在后端,map reduce可以完成这项工作。 Hope this info helps 希望此信息对您有所帮助

Im not agree with that Pig and Hive solve the same problem, Hive is for querying data stored on hdfs as external or internal tables, Pig is for managing data flow stored on hdfs in a Directed Acyclic Graph, this is their main goals and we dont care about other uses, here i want to make difference between : 我不同意Pig和Hive解决相同的问题,Hive用于查询存储在外部或内部表中的hdfs上的数据,Pig用于管理有向无环图中存储在hdfs上的数据流,这是他们的主要目标,我们不关心其他用途,在这里我想区别一下:

  • Querying data (the main purpose of Hive) which is getting answers to some questions about your data, for example : How many distinct user visiting my website per mounth in this year. 查询数据(Hive的主要目的)正在获取有关您的数据的一些问题的答案,例如:今年每安装一次访问我的网站有多少不同用户。
  • Managing a data flow (the main purpose of Pig) is to make your data go from initial state to have at the end a different state through transformations, for example : Data in location A filtered by critiria c joined with data in location B stored in location C. 管理数据流(Pig的主要目的)是通过转换使您的数据从初始状态转变为最终的不同状态,例如:用critiria过滤的位置A中的数据与存储在位置B中的数据结合在一起位置C。

Smeeb, Pig,Hive does same thing , I mean processing data which comes in files or what ever format. Smeeb,Pig,Hive做同样的事情,我的意思是处理文件或任何格式的数据。 here if you want to process data present in RDMS, first get that data to HDFS with help of Sqoop(SQL+HADOOP). 如果要处理RDMS中存在的数据,请首先在Sqoop(SQL + HADOOP)的帮助下将该数据发送到HDFS。

Hive used HQL like SQL to process, Pig uses kind flow with help of piglatin. Hive使用了像SQL一样的HQL进行处理,Pig在Piglatin的帮助下使用了类流程。 Hive stores all input data in tables format so, first thing before load data to Hive create a hive table, that structure (metadata) will be stored in any RDMS(Mysql). Hive以表格式存储所有输入数据,因此,在将数据加载到Hive之前,首先要创建一个Hive表,该结构(元数据)将存储在任何RDMS(Mysql)中。 Then load with LOAD DATA INPATH 'path/in/your/HDFS/your.csv' INTO TABLE t1; 然后用LOAD DATA INPATH'path / in / your / HDFS / your.csv'装入表t1;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM