简体   繁体   English

什么时候使用 Hadoop、HBase、Hive 和 Pig?

[英]When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ?使用HadoopHBaseHive 有什么好处?

From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS.根据我的理解, HBase避免使用 map-reduce,并且在 HDFS 之上有一个面向列的存储。 Hive is a sql-like interface for Hadoop and HBase . HiveHadoopHBase的类 sql 接口。

I would also like to know how Hive compares with Pig .我也想知道HivePig 的比较。

MapReduce is just a computing framework . MapReduce 只是一个计算框架 HBase has nothing to do with it. HBase 与它无关。 That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs.也就是说,您可以通过编写 MapReduce 作业高效地向/从 HBase 放入或获取数据。 Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data.或者,您可以使用其他 HBase API(例如 Java)编写顺序程序来放置或获取数据。 But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense.但是我们使用 Hadoop、HBase 等来处理海量数据,所以这没有多大意义。 Using normal sequential programs would be highly inefficient when your data is too huge.当您的数据太大时,使用普通的顺序程序会非常低效。

Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce) .回到问题的第一部分,Hadoop 基本上是两件事:分布式文件系统 (HDFS) +计算或处理框架 (MapReduce) Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication).与所有其他 FS 一样,HDFS 也为我们提供存储,但以容错方式提供高吞吐量和较低的数据丢失风险(由于复制)。 But, being a FS, HDFS lacks random read and write access .但是,作为 FS,HDFS 缺乏随机读写访问 This is where HBase comes into picture.这就是 HBase 出现的地方。 It's a distributed, scalable, big data store , modelled after Google's BigTable.它是一个分布式、可扩展的大数据存储,以 Google 的 BigTable 为模型。 It stores data as key/value pairs.它将数据存储为键/值对。

Coming to Hive.来到蜂巢。 It provides us data warehousing facilities on top of an existing Hadoop cluster.它为我们提供了基于现有 Hadoop 集群的数据仓库设施。 Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background.除此之外,它还提供了一个类似 SQL 的界面,如果您来自 SQL 背景,这将使您的工作更轻松。 You can create tables in Hive and store data there.您可以在 Hive 中创建表并在那里存储数据。 Along with that you can even map your existing HBase tables to Hive and operate on them.除此之外,您甚至可以将现有的 HBase 表映射到 Hive 并对其进行操作。

While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly.而 Pig 基本上是一种数据流语言,它使我们能够非常轻松快速地处理大量数据。 Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin . Pig 基本上有 2 个部分:Pig解释器和语言PigLatin You write Pig script in PigLatin and using Pig interpreter process them.您在 PigLatin 中编写 Pig 脚本并使用 Pig 解释器处理它们。 Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. Pig 让我们的生活轻松了很多,否则写 MapReduce 总是不容易。 In fact in some cases it can really become a pain.事实上,在某些情况下,它真的会变成一种痛苦。

I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago.前段时间我写了一篇关于 Hadoop 生态系统不同工具的简短比较文章 It's not an in depth comparison, but a short intro to each of these tools which can help you to get started.这不是深入的比较,而是对这些工具中的每一个的简短介绍,可以帮助您入门。 (Just to add on to my answer. No self promotion intended) (只是为了补充我的答案。没有自我宣传的意图)

Both Hive and Pig queries get converted into MapReduce jobs under the hood. Hive 和 Pig 查询都在幕后转换为 MapReduce 作业。

HTH HTH

I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.我最近在我的公司实施了一个 Hive 数据平台,因为我是一个单人团队,所以可以用第一人称说话。

Objective客观的

  1. To have the daily web log files collected from 350+ servers daily queryable thru some SQL like language每天从 350 多台服务器收集的每日网络日志文件可以通过一些类似SQL 的语言进行查询
  2. To replace daily aggregation data generated thru MySQL with Hive用 Hive 替换通过MySQL生成的每日聚合数据
  3. Build Custom reports thru queries in Hive通过Hive 中的查询构建自定义报告

Architecture Options架构选项

I benchmarked the following options:我对以下选项进行了基准测试:

  1. Hive+HDFS蜂巢+HDFS
  2. Hive+HBase - queries were too slow so I dumped this option Hive+HBase - 查询太慢所以我放弃了这个选项

Design设计

  1. Daily log Files were transported to HDFS每日日志文件被传输到HDFS
  2. MR jobs parsed these log files and output files in HDFS MR 作业在HDFS 中解析这些日志文件和输出文件
  3. Create Hive tables with partitions and locations pointing to HDFS locations创建带有指向HDFS位置的分区和位置的 Hive 表
  4. Create Hive query scripts (call it HQL if you like as diff from SQL ) that in turn ran MR jobs in the background and generated aggregation data创建 Hive 查询脚本(如果您喜欢将其称为HQLSQL 的差异),该脚本又会在后台运行 MR 作业并生成聚合数据
  5. Put all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator将所有这些步骤放入Oozie工作流程中 - 由 Daily Oozie Coordinator 安排

Summary概括

HBase is like a Map. HBase就像一个地图。 If you know the key, you can instantly get the value.如果您知道密钥,您可以立即获得价值。 But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.但是如果你想知道 Hbase 中有多少整数键在 1000000 到 2000000 之间,那不适合单独使用Hbase

If you have data that needs to be aggregated, rolled up, analyzed across rows then consider Hive .如果您有需要跨行聚合、汇总、分析的数据,请考虑Hive

Hopefully this helps.希望这会有所帮助。

Hive actually rocks ...I know, I have lived it for 12 months now... So does HBase ... Hive实际上很摇滚......我知道,我已经用了 12 个月了...... HBase 也是如此......

Hadoop is aa framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop是一个框架,它允许使用简单的编程模型跨计算机集群分布式处理大型数据集。

There are four main modules in Hadoop. Hadoop 中有四个主要模块。

  1. Hadoop Common : The common utilities that support the other Hadoop modules. Hadoop Common :支持其他 Hadoop 模块的通用实用程序。

  2. Hadoop Distributed File System ( HDFS™ ): A distributed file system that provides high-throughput access to application data. Hadoop 分布式文件系统 ( HDFS™ ):提供对应用程序数据的高吞吐量访问的分布式文件系统。

  3. Hadoop YARN : A framework for job scheduling and cluster resource management. Hadoop YARN :用于作业调度和集群资源管理的框架。

  4. Hadoop MapReduce : A YARN-based system for parallel processing of large data sets. Hadoop MapReduce :一个基于 YARN 的系统,用于并行处理大型数据集。

Before going further, Let's note that we have three different types of data.在继续之前,让我们注意我们有三种不同类型的数据。

  • Structured : Structured data has strong schema and schema will be checked during write & read operation.结构化:结构化数据具有强大的架构,在写入和读取操作期间将检查架构。 eg Data in RDBMS systems like Oracle, MySQL Server etc.例如,Oracle、MySQL 服务器等 RDBMS 系统中的数据。

  • Unstructured : Data does not have any structure and it can be any form - Web server logs, E-Mail, Images etc.非结构化:数据没有任何结构,可以是任何形式——Web 服务器日志、电子邮件、图像等。

  • Semi-structured : Data is not strictly structured but have some structure.半结构化:数据不是严格结构化的,但具有一定的结构。 eg XML files.例如 XML 文件。

Depending on type of data to be processed, we have to choose right technology.根据要处理的数据类型,我们必须选择正确的技术。

Some more projects, which are part of Hadoop:还有一些项目,它们是 Hadoop 的一部分:

  • HBase™ : A scalable, distributed database that supports structured data storage for large tables. HBase™ :一种可扩展的分布式数据库,支持大型表的结构化数据存储。

  • Hive ™: A data warehouse infrastructure that provides data summarization and ad-hoc querying. Hive ™:提供数据汇总和即席查询的数据仓库基础架构。

  • Pig™ : A high-level data-flow language and execution framework for parallel computation. Pig™ :用于并行计算的高级数据流语言和执行框架。

Hive Vs PIG comparison can be found at this article and my other post at this SE question . Hive 与 PIG 的比较可以在这篇文章和我在这个 SE 问题上的一篇文章中找到。

HBASE won't replace Map Reduce. HBASE不会取代 Map Reduce。 HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. HBase是可扩展的分布式数据库, Map Reduce是分布式处理数据的编程模型。 Map Reduce may act on data in HBASE in processing. Map Reduce 在处理中可能作用于 HBASE 中的数据。

You can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce您可以将HIVE/HBASE用于结构化/半结构化数据,并使用 Hadoop Map Reduce 对其进行处理

You can use SQOOP to import structured data from traditional RDBMS database Oracle, SQL Server etc and process it with Hadoop Map Reduce您可以使用SQOOP从传统的 RDBMS 数据库 Oracle、SQL Server 等导入结构化数据并使用 Hadoop Map Reduce 进行处理

You can use FLUME for processing Un-structured data and process with Hadoop Map Reduce您可以使用FLUME处理非结构化数据并使用 Hadoop Map Reduce 进行处理

Have a look at: Hadoop Use Cases .看看: Hadoop 用例

Hive should be used for analytical querying of data collected over a period of time. Hive应该用于对一段时间内收集的数据进行分析查询。 eg Calculate trends, summarize website logs but it can't be used for real time queries.例如计算趋势,汇总网站日志,但不能用于实时查询。

HBase fits for real-time querying of Big Data. HBase适合大数据的实时查询。 Facebook use it for messaging and real-time analytics. Facebook 将其用于消息传递和实时分析。

PIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. PIG可用于构建数据流、运行预定作业、处理大量数据、聚合/汇总并存储到关系数据库系统中。 Good for ad-hoc analysis.适合临时分析。

Hive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG. Hive可用于临时数据分析,但与 PIG 不同,它不能支持所有非结构化数据格式。

Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.考虑到您使用 RDBMS 并且必须选择要使用的内容 - 全表扫描或索引访问 - 但只能选择其中之一。
If you select full table scan - use hive.如果您选择全表扫描 - 使用 hive。 If index access - HBase.如果索引访问 - HBase。

Understanding in depth深入了解

Hadoop Hadoop

Hadoop is an open source project of the Apache foundation. HadoopApache基金会的一个开源项目。 It is a framework written in Java , originally developed by Doug Cutting in 2005. It was created to support distribution for Nutch , the text search engine.它是一个用Java编写的框架,最初由 Doug Cutting 于 2005 年开发。创建它是为了支持文本搜索引擎Nutch分发。 Hadoop uses Google's Map Reduce and Google File System Technologies as its foundation. Hadoop使用 Google 的Map Reduce和 Google 文件系统技术作为其基础。

Features of Hadoop Hadoop的特点

  1. It is optimized to handle massive quantities of structured, semi-structured and unstructured data using commodity hardware.它经过优化,可以使用商用硬件处理大量结构化、半结构化和非结构化数据。
  2. It has shared nothing architecture.它没有共享任何架构。
  3. It replicates its data into multiple computers so that if one goes down, the data can still be processed from another machine that stores its replica.它将其数据复制到多台计算机中,这样即使一台计算机出现故障,仍可以从存储其副本的另一台计算机处理数据。
  4. Hadoop is for high throughput rather than low latency. Hadoop是为了高吞吐量而不是低延迟。 It is a batch operation handling massive quantities of data;它是处理大量数据的批处理操作; therefore the response time is not immediate.因此响应时间不是即时的。
  5. It complements Online Transaction Processing and Online Analytical Processing.它补充了在线事务处理和在线分析处理。 However, it is not a replacement for a RDBMS .但是,它不是RDBMS的替代品。
  6. It is not good when work cannot be parallelized or when there are dependencies within the data.当工作无法并行化或数据中存在依赖关系时,这是不好的。
  7. It is not good for processing small files.不利于处理小文件。 It works best with huge data files and data sets.它最适用于庞大的数据文件和数据集。

Versions of Hadoop Hadoop 的版本

There are two versions of Hadoop available :有两个版本的Hadoop可用:

  1. Hadoop 1.0 Hadoop 1.0
  2. Hadoop 2.0 Hadoop 2.0

Hadoop 1.0 Hadoop 1.0

It has two main parts :它有两个主要部分:

1. Data Storage Framework 1. 数据存储框架

It is a general-purpose file system called Hadoop Distributed File System ( HDFS ).它是一个称为 Hadoop 分布式文件系统 ( HDFS ) 的通用文件系统。

HDFS is schema-less HDFS是无模式的

It simply stores data files and these data files can be in just about any format.它只是存储数据文件,这些数据文件几乎可以是任何格式。

The idea is to store files as close to their original form as possible.这个想法是尽可能接近其原始形式存储文件。

This in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.这反过来为业务部门和组织提供了急需的灵活性和敏捷性,而不必过度担心它可以实施的内容。

2. Data Processing Framework 2. 数据处理框架

This is a simple functional programming model initially popularized by Google as MapReduce .这是一个简单的函数式编程模型,最初由 Google 推广为MapReduce

It essentially uses two functions: MAP and REDUCE to process data.它本质上使用两个函数: MAPREDUCE来处理数据。

The "Mappers" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs). “映射器”接收一组键值对并生成中间数据(这是另一个键值对列表)。

The "Reducers" then act on this input to produce the output data. “Reducers”然后作用于这个输入以产生输出数据。

The two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.这两个功能似乎彼此隔离地工作,从而使处理能够以高度并行、容错和可扩展的方式高度分布。

Limitations of Hadoop 1.0 Hadoop 1.0 的局限性

  1. The first limitation was the requirement of MapReduce programming expertise.第一个限制是需要MapReduce编程专业知识。

  2. It supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.它只支持批处理,虽然适用于日志分析、大型数据挖掘项目等任务,但几乎不适合其他类型的项目。

  3. One major limitation was that Hadoop 1.0 was tightly computationally coupled with MapReduce , which meant that the established data management vendors where left with two opinions:一个主要限制是Hadoop 1.0在计算上与MapReduce紧密耦合,这意味着成熟的数据管理供应商留下了两种意见:

    1. Either rewrite their functionality in MapReduce so that it could be executed in Hadoop or要么在MapReduce重写它们的功能,以便它可以在Hadoop执行,要么

    2. Extract data from HDFS or process it outside of Hadoop .HDFS提取数据或在Hadoop之外对其进行处理。

None of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop cluster.所有选项都不可行,因为它会导致由于数据移入和移出Hadoop集群而导致流程效率低下。

Hadoop 2.0 Hadoop 2.0

In Hadoop 2.0 , HDFS continues to be data storage framework.Hadoop 2.0HDFS继续作为数据存储框架。

However, a new and seperate resource management framework called Y et A nother R esource N egotiater ( YARN ) has been added.然而,已被添加称为Y诺特尔方案资源Ñegotiater()一种新的和独立的资源管理框架。

Any application capable of dividing itself into parallel tasks is supported by YARN. YARN 支持任何能够将自身划分为并行任务的应用程序。

YARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications. YARN 协调提交的应用程序的子任务分配,从而进一步增强应用程序的灵活性、可扩展性和效率。

It works by having an Application Master in place of Job Tracker , running applications on resources governed by new Node Manager .它的工作原理是使用Application Master代替Job Tracker ,在由新节点管理器管理的资源上运行应用程序。

ApplicationMaster is able to run any application and not just MapReduce . ApplicationMaster 能够运行任何应用程序,而不仅仅是MapReduce

This means it does not only support batch processing but also real-time processing.这意味着它不仅支持批处理,还支持实时处理。 MapReduce is no longer the only data processing option. MapReduce不再是唯一的数据处理选项。

Advantages of Hadoop Hadoop的优势

It stores data in its native from.它将数据存储在其本机中。 There is no structure imposed while keying in data or storing data.在键入数据或存储数据时没有强加任何结构。 HDFS is schema less. HDFS是无模式的。 It is only later when the data needs to be processed that the structure is imposed on the raw data.只有在稍后需要处理数据时,才会将结构强加于原始数据。

It is scalable.它是可扩展的。 Hadoop can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel. Hadoop可以在数百台并行运行的廉价服务器上存储和分发非常大的数据集。

It is resilient to failure.它对失败具有弹性。 Hadoop is fault tolerance. Hadoop是容错的。 It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.它勤奋地进行数据复制,这意味着每当数据发送到任何节点时,相同的数据也会被复制到集群中的其他节点,从而确保在节点故障的情况下,始终有另一个数据副本可供使用。

It is flexible.它是灵活的。 One of the key advantages of Hadoop is that it can work with any kind of data: structured, unstructured or semi-structured. Hadoop的主要优势之一是它可以处理任何类型的数据:结构化、非结构化或半结构化数据。 Also, the processing is extremely fast in Hadoop owing to the "move code to data" paradigm.此外,由于“将代码移动到数据”范例, Hadoop的处理速度非常快。

Hadoop Ecosystem Hadoop生态系统

Following are the components of Hadoop ecosystem:以下是Hadoop生态系统的组成部分:

HDFS : Hadoop Distributed File System. HDFSHadoop分布式文件系统。 It simply stores data files as close to the original form as possible.它只是存储尽可能接近原始形式的数据文件。

HBase : It is Hadoop's database and compares well with an RDBMS . HBase :它是 Hadoop 的数据库,可以与RDBMS进行比较。 It supports structured data storage for large tables.它支持大表的结构化数据存储。

Hive : It enables analysis of large datasets using a language very similar to standard ANSI SQL , which implies that anyone familier with SQL should be able to access data on a Hadoop cluster. Hive :它可以使用与标准ANSI SQL非常相似的语言来分析大型数据集,这意味着任何熟悉SQL人都应该能够访问Hadoop集群上的数据。

Pig : It is an easy to understand data flow language. Pig :它是一种易于理解的数据流语言。 It helps with analysis of large datasets which is quite the order with Hadoop .它有助于分析大型数据集,这与Hadoop非常Hadoop Pig scripts are automatically converted to MapReduce jobs by the Pig interpreter. Pig脚本由Pig解释器自动转换为MapReduce作业。

ZooKeeper : It is a coordination service for distributed applications. ZooKeeper :它是分布式应用程序的协调服务。

Oozie : It is a workflow schedular system to manage Apache Hadoop jobs. Oozie :它是一个管理 Apache Hadoop作业的工作流schedular系统。

Mahout : It is a scalable machine learning and data mining library. Mahout :它是一个可扩展的机器学习和数据挖掘库。

Chukwa : It is data collection system for managing large distributed system. Chukwa :它是用于管理大型分布式系统的数据收集系统。

Sqoop : It is used to transfer bulk data between Hadoop and structured data stores such as relational databases. Sqoop :用于在Hadoop和关系数据库等结构化数据存储之间传输批量数据。

Ambari : It is a web based tool for provisioning, managing and monitoring Hadoop clusters. Ambari :它是一个基于 Web 的工具,用于配置、管理和监控Hadoop集群。

Hive蜂巢

Hive is a data warehouse infrastructure tool to process structured data in Hadoop . Hive是一个数据仓库基础设施工具,用于在Hadoop处理结构化数据。 It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.它驻留在Hadoop以汇总大数据并使查询和分析变得容易。

Hive is not蜂巢不是

  1. A relational database关系型数据库

  2. A design for Online Transaction Processing ( OLTP ).在线事务处理 ( OLTP ) 的设计。

  3. A language for real-time queries and row-level updates.一种用于实时查询和行级更新的语言。

Features of Hive蜂巢的特点

  1. It stores schema in database and processed data into HDFS .它将模式存储在数据库中并将处理后的数据存储到HDFS

  2. It is designed for OLAP .它是为OLAP设计的。

  3. It provides SQL type language for querying called HiveQL or HQL .它提供用于查询的SQL类型语言,称为HiveQLHQL

  4. It is familier, fast, scalable and extensible.它更家庭、更快速、可扩展和可扩展。

Hive Architecture蜂巢架构

The following components are contained in Hive Architecture: Hive 架构中包含以下组件:

  1. User Interface : Hive is a data warehouse infrastructure that can create interaction between user and HDFS .用户界面Hive是一个data warehouse基础设施,可以创建用户和HDFS之间的交互。 The User Interfaces that Hive supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server). Hive支持的用户界面是 Hive Web UI、Hive 命令行和 Hive HD Insight(在 Windows Server 中)。

  2. MetaStore : Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping. MetaStoreHive选择各自的database servers来存储表、数据库、表中的列、它们的数据类型和HDFS映射的模式或Metadata数据。

  3. HiveQL Process Engine : HiveQL is similar to SQL for querying on schema info on the Metastore . HiveQL 流程引擎HiveQL类似于SQL用于查询Metastore上的架构信息。 It is one of the replacements of traditional approach for MapReduce program.它是MapReduce程序传统方法的替代品之一。 Instead of writing MapReduce in Java , we can write a query for MapReduce and process it.我们可以为MapReduce编写查询并处理它,而不是在Java中编写MapReduce

  4. Exceution Engine : The conjunction part of HiveQL process engine and MapReduce is the Hive Execution Engine.执行引擎HiveQL流程引擎和MapReduce的结合部分是Hive执行引擎。 Execution engine processes the query and generates results as same as MapReduce results .执行引擎处理查询并生成与MapReduce results相同的MapReduce results It uses the flavor of MapReduce .它使用MapReduce的风格。

  5. HDFS or HBase : Hadoop Distributed File System or HBase are the data storage techniques to store data into file system. HDFS 或 HBaseHadoop分布式文件系统或HBase是将数据存储到文件系统中的数据存储技术。

For a Comparison Between Hadoop Vs Cassandra/HBase read this post .有关 Hadoop 与 Cassandra/HBase 之间的比较,请阅读这篇文章

Basically HBase enables really fast read and writes with scalability.基本上,HBase 实现了具有可扩展性的真正快速的读写。 How fast and scalable?速度和可扩展性如何? Facebook uses it to manage its user statuses, photos, chat messages etc. HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself. Facebook使用它来管理其用户状态、照片、聊天消息等。HBase 非常快,有时 Facebook 开发了堆栈来使用 HBase 作为 Hive 本身的数据存储。

Where As Hive is more like a Data Warehousing solution. As Hive 更像是一个数据仓库解决方案。 You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job.您可以使用类似于 SQL 的语法来查询 Hive 内容,从而生成 Map Reduce 作业。 Not ideal for fast, transactional systems.不适合快速的事务性系统。

I worked on Lambda architecture processing Real time and Batch loads.我从事 Lambda 架构处理实时和批量加载。 Real time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions.在传感器发送火灾警报或银行交易欺诈检测的情况下需要快速决策的情况下,需要实时处理。 Batch processing is needed to summarize data which can be feed into BI systems.需要批处理来汇总可以输入 BI 系统的数据。

we used Hadoop ecosystem technologies for above applications.我们将Hadoop生态系统技术用于上述应用程序。

Real Time Processing实时处理

Apache Storm: Stream Data processing, Rule application Apache Storm:流数据处理、规则应用

HBase: Datastore for serving Realtime dashboard HBase:用于提供实时仪表板的数据存储

Batch Processing Hadoop: Crunching huge chunk of data.批处理Hadoop:处理大量数据。 360 degrees overview or adding context to events. 360 度概览或为事件添加上下文。 Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. Pig、MR、Spark、Hive、Shark 等接口或框架有助于计算。 This layer needs scheduler for which Oozie is good option.这一层需要调度程序,Oozie 是不错的选择。

Event Handling layer事件处理层

Apache Kafka was first layer to consume high velocity events from sensor. Apache Kafka 是从传感器消耗高速事件的第一层。 Kafka serves both Real Time and Batch analytics data flow through Linkedin connectors. Kafka 通过 Linkedin 连接器提供实时和批量分析数据流。

First of all we should get clear that Hadoop was created as a faster alternative to RDBMS .首先,我们应该清楚Hadoop是作为RDBMS的更快替代品而创建的。 To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS.以非常快的速度处理大量数据,这在早期在 RDBMS 中花费了大量时间。

Now one should know the two terms :现在应该知道这两个术语:

  1. Structured Data : This is the data that we used in traditional RDBMS and is divided into well defined structures.结构化数据:这是我们在传统 RDBMS 中使用的数据,并被划分为明确定义的结构。

  2. Unstructured Data : This is important to understand, about 80% of the world data is unstructured or semi structured.非结构化数据:理解这一点很重要,世界上大约 80% 的数据是非结构化或半结构化的。 These are the data which are on its raw form and cannot be processed using RDMS.这些是原始形式的数据,无法使用 RDMS 进行处理。 Example : facebook, twitter data.示例:脸书、推特数据。 ( http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html ). http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html )。

So, large amount of data was being generated in the last few years and the data was mostly unstructured, that gave birth to HADOOP.因此,最近几年产生了大量的数据,而且数据大多是非结构化的,这就催生了 HADOOP。 It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS.它主要用于使用 RDBMS 花费不可行时间的大量数据。 It had many drawbacks, that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version.它有很多缺点,无法实时用于相对较小的数据,但他们设法在较新版本中消除了其缺点。

Before going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools.在进一步讨论之前,我想说的是,当他们发现以前的工具出现故障时,会创建一个新的大数据工具。 So, whichever tool you will see that is created has been done to overcome the problem of the previous tools.因此,您将看到创建的任何工具都已完成以克服以前工具的问题。

Hadoop can be simply said as two things : Mapreduce and HDFS . Hadoop可以简单地说是两件事: MapreduceHDFS Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored. Mapreduce 是处理发生的地方,而 HDFS 是存储数据的数据库。 This structure followed WORM principal ie write once read multiple times.这种结构遵循WORM原则,即一次写入多次读取。 So, once we have stored data in HDFS, we cannot make changes.因此,一旦我们将数据存储在 HDFS 中,就无法进行更改。 This led to the creation of HBASE , a NOSQL product where we can make changes in the data also after writing it once.这导致了HBASE的创建,这是一种 NOSQL 产品,我们也可以在编写一次数据后对其进行更改。

But with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure.但随着时间的推移,我们发现 Hadoop 有很多缺陷,为此我们在 Hadoop 结构上创建了不同的环境。 PIG and HIVE are two popular examples. PIG 和 HIVE 是两个流行的例子。

HIVE was created for people with SQL background. HIVE是为具有SQL背景的人创建的。 The queries written is similar to SQL named as HIVEQL .编写的查询类似于名为HIVEQL 的SQL。 HIVE was developed to process completely structured data . HIVE 旨在处理完全结构化的数据 It is not used for ustructured data.它不用于非结构化数据。

PIG on the other hand has its own query language ie PIG LATIN .另一方面, PIG有自己的查询语言,即PIG LATIN It can be used for both structured as well as unstructured data .它可同时用于结构化非结构化数据

Moving to the difference as when to use HIVE and when to use PIG, I don't think anyone other than the architect of PIG could say.关于何时使用 HIVE 和何时使用 PIG 的区别,我认为除了 PIG 的架构师之外,没有任何人可以说。 Follow the link : https://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html按照链接: https : //developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html

Hadoop: Hadoop:

HDFS stands for Hadoop Distributed File System which uses Computational processing model Map-Reduce. HDFS 代表 Hadoop 分布式文件系统,它使用计算处理模型 Map-Reduce。

HBase: HBase:

HBase is Key-Value storage, good for reading and writing in near real time. HBase 是 Key-Value 存储,适合近实时读写。

Hive:蜂巢:

Hive is used for data extraction from the HDFS using SQL-like syntax. Hive 用于使用类似 SQL 的语法从 HDFS 中提取数据。 Hive use HQL language. Hive 使用 HQL 语言。

Pig:猪:

Pig is a data flow language for creating ETL. Pig 是一种用于创建 ETL 的数据流语言。 It's an scripting language.它是一种脚本语言。

Let me try to answer in few words.让我试着用几句话来回答。

Hadoop is an eco-system which comprises of all other tools. Hadoop 是一个包含所有其他工具的生态系统。 So, you can't compare Hadoop but you can compare MapReduce.因此,您无法比较 Hadoop,但可以比较 MapReduce。

Here are my few cents:这是我的几分钱:

  1. Hive: If your need is very SQLish meaning your problem statement can be catered by SQL, then the easiest thing to do would be to use Hive. Hive:如果您的需求非常 SQL,这意味着您的问题陈述可以由 SQL 来满足,那么最简单的方法就是使用 Hive。 The other case, when you would use hive is when you want a server to have certain structure of data.另一种情况,当您使用 hive 时,您希望服务器具有特定的数据结构。
  2. Pig: If you are comfortable with Pig Latin and you need is more of the data pipelines. Pig:如果您对 Pig Latin 感到满意,并且您需要更多的数据管道。 Also, your data lacks structure.此外,您的数据缺乏结构。 In those cases, you could use Pig.在这些情况下,您可以使用 Pig。 Honestly there is not much difference between Hive & Pig with respect to the use cases.老实说,Hive 和 Pig 在用例方面没有太大区别。
  3. MapReduce: If your problem can not be solved by using SQL straight, you first should try to create UDF for Hive & Pig and then if the UDF is not solving the problem then getting it done via MapReduce makes sense. MapReduce:如果你的问题不能直接使用 SQL 解决,你首先应该尝试为 Hive & Pig 创建 UDF,然后如果 UDF 不能解决问题,那么通过 MapReduce 完成它是有意义的。

Pig:最好处理文件和清理数据示例:删除空值、字符串处理、不必要的值 Hive:用于查询已清理的数据

1.We are using Hadoop for storing Large data (iestructure,Unstructure and Semistructure data ) in the form file format like txt,csv. 1.我们使用Hadoop以txt,csv等形式文件格式存储大数据(即结构,非结构和半结构数据)。

2.If We want columnar Updations in our data then we are using Hbase tool 2.如果我们想要在我们的数据中进行柱状更新,那么我们正在使用 Hbase 工具

3.In case of Hive , we are storing Big data which is in structured format and in addition to that we are providing Analysis on that data. 3.在 Hive 的情况下,我们以结构化格式存储大数据,除此之外,我们还提供对该数据的分析。

4.Pig is tool which is using Pig latin language to analyze data which is in any format(structure,semistructure and unstructure). 4.Pig 是使用 Pig 拉丁语言来分析任何格式(结构化、半结构化和非结构化)数据的工具。

在 Pig 中清理数据非常简单,合适的方法是通过 Pig 清理数据,然后通过 hive 处理数据,然后将其上传到 hdfs。

Use of Hive, Hbase and Pig wrt my real time experience in different projects. Hive、Hbase 和 Pig 的使用是我在不同项目中的实时经验。

Hive is used mostly for: Hive 主要用于:

  • Analytics purpose where you need to do analysis on history data分析目的,您需要对历史数据进行分析

  • Generating business reports based on certain columns基于某些列生成业务报告

  • Efficiently managing the data together with metadata information有效地管理数据和元数据信息

  • Joining tables on certain columns which are frequently used by using bucketing concept通过使用分桶概念连接某些经常使用的列上的表

  • Efficient Storing and querying using partitioning concept使用分区概念进行高效存储和查询

  • Not useful for transaction/row level operations like update, delete, etc.对事务/行级操作(如更新、删除等)没有用。

Pig is mostly used for:猪主要用于:

  • Frequent data analysis on huge data海量数据的频繁数据分析

  • Generating aggregated values/counts on huge data在大量数据上生成聚合值/计数

  • Generating enterprise level key performance indicators very frequently非常频繁地生成企业级关键绩效指标

Hbase is mostly used: Hbase 主要用于:

  • For real time processing of data用于数据的实时处理

  • For efficiently managing Complex and nested schema用于有效管理复杂和嵌套模式

  • For real time querying and faster result用于实时查询和更快的结果

  • For easy Scalability with columns使用列轻松实现可扩展性

  • Useful for transaction/row level operations like update, delete, etc.对更新、删除等事务/行级操作很有用。

Short answer to this question is -这个问题的简短回答是——

Hadoop - Is Framework which facilitates distributed file system and programming model which allow us to store humongous sized data and process data in distributed fashion very efficiently and with very less processing time compare to traditional approaches. Hadoop - 是一种促进分布式文件系统和编程模型的框架,与传统方法相比,它允许我们以分布式方式非常有效地存储海量数据和处理数据,并且处理时间非常短。

(HDFS - Hadoop Distributed File system) (Map Reduce - Programming Model for distributed processing) (HDFS - Hadoop 分布式文件系统) (Map Reduce - 分布式处理的编程模型)

Hive - Is query language which allows to read/write data from Hadoop distributed file system in a very popular SQL like fashion. Hive - 是一种查询语言,它允许以非常流行的类似 SQL 的方式从 Hadoop 分布式文件系统读取/写入数据。 This made life easier for many non-programming background people as they don't have to write Map-Reduce program anymore except for very complex scenarios where Hive is not supported.这让许多非编程背景的人的生活变得更轻松,因为除了不支持 Hive 的非常复杂的场景外,他们不必再编写 Map-Reduce 程序。

Hbase - Is Columnar NoSQL Database. Hbase - 是列式 NoSQL 数据库。 Underlying storage layer for Hbase is again HDFS. Hbase 的底层存储层又是 HDFS。 Most important use case for this database is to be able to store billion's of rows with million's of columns.该数据库最重要的用例是能够存储具有数百万列的数十亿行。 Low latency feature of Hbase helps faster and random access of record over distributed data, is very important feature to make it useful for complex projects like Recommender Engines. Hbase 的低延迟特性有助于更快地随机访问分布式数据上的记录,这是使其对推荐引擎等复杂项目有用的非常重要的特性。 Also it's record level versioning capability allow user to store transactional data very efficiently (this solves the problem of updating records we have with HDFS and Hive)此外,它的记录级版本控制功能允许用户非常有效地存储事务数据(这解决了我们使用 HDFS 和 Hive 更新记录的问题)

Hope this is helpful to quickly understand the above 3 features.希望这有助于快速了解上述 3 个功能。

I believe this thread hasn't done in particular justice to HBase and Pig in particular.我相信这个线程并没有特别公正地对待 HBase 和 Pig。 While I believe Hadoop is the choice of the distributed, resilient file-system for big-data lake implementations, the choice between HBase and Hive is in particular well-segregated.虽然我相信 Hadoop 是用于大数据湖实现的分布式、弹性文件系统的选择,但 HBase 和 Hive 之间的选择尤其是隔离良好的。

As in, a lot of use-cases have a particular requirement of SQL like or No-SQL like interfaces.就像在许多用例中一样,对类 SQL 或类似 No-SQL 的接口有特殊要求。 With Phoenix on top of HBase, though SQL like capabilities is certainly achievable, however, the performance, third-party integrations, dashboard update are a kind of painful experiences.有了HBase之上的Phoenix,虽然SQL之类的能力肯定是可以实现的,但是性能、第三方集成、仪表盘更新都是一种痛苦的体验。 However, it's an excellent choice for databases requiring horizontal scaling.但是,它是需要水平扩展的数据库的绝佳选择。

Pig is in particular excellent for non-recursive batch like computations or ETL pipelining (somewhere, where it outperforms Spark by a comfortable distance). Pig 尤其适用于非递归批处理,如计算或 ETL 流水线(在某些地方,它在舒适的距离上优于 Spark)。 Also, it's high-level dataflow implementations is an excellent choice for batch querying and scripting.此外,它的高级数据流实现是批量查询和脚本编写的绝佳选择。 The choice between Pig and Hive is also pivoted on the need of the client or server-side scripting, required file formats, etc. Pig supports Avro file format which is not true in the case of Hive. Pig 和 Hive 之间的选择也取决于客户端或服务器端脚本的需要、所需的文件格式等。Pig 支持 Avro 文件格式,这在 Hive 的情况下是不正确的。 The choice for 'procedural dataflow language' vs 'declarative data flow language' is also a strong argument for the choice between pig and hive. “过程数据流语言”与“声明性数据流语言”的选择也是在 pig 和 hive 之间进行选择的有力论据。

Pig is mostly dead after Cloudera got rid of it in CDP.在 Cloudera 在 CDP 中摆脱它之后,Pig 大部分都死了。 Also last release on Apache was 19 June, 2017: release 0.17.0 so basically no committers actively working anymore. Apache 的最后一次发布是 2017 年 6 月 19 日:发布 0.17.0,因此基本上没有提交者积极工作了。 Use Spark or Python way more powerful than Pig.使用比 Pig 更强大的 Spark 或 Python 方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM