简体繁体 English

Hadoop中存储的数据在哪里

[英]where is data stored in hadoop

原文 2014-03-03 19:17:23 0 1 hadoop/ mapreduce/ hdfs/ distributed-computing

Though I have understood the architecture of hadoop a bit , I have some void in understanding of where the data is exactly situated. 尽管我稍微了解了hadoop的体系结构，但在理解数据的确切位置方面还是有些空白。

My question is like " Suppose I have large data of some random books .. is the data of books stored in multiple Nodes previously using HDFS and we perform MapReduce on each node and get the result in our system ? 我的问题是：“假设我有一些随机书的大数据..以前使用HDFS将书中的数据存储在多个节点中，并且我们在每个节点上执行MapReduce并在我们的系统中得到结果吗？

'OR' '要么'

Do we store data some where in large database and whenever we want to perform the MapReduce operation, we take the chunks and store them in multiple Nodes for performing operation ? 我们是否将数据存储在大型数据库中的某些位置，并且每当我们要执行MapReduce操作时，都将数据块存储在多个Node中以执行操作？

1 个解决方案

Either is possible, it really depends on your use case and needs. 两者皆有可能，这实际上取决于您的用例和需求。 However, generally Hadoop MapReduce runs against data stored in HDFS. 但是，通常，Hadoop MapReduce会针对HDFS中存储的数据运行。 The system is designed around data locality which requires the data be in HDFS. 该系统是围绕数据局部性设计的，该局部性要求数据位于HDFS中。 That is the Map tasks run on the same piece of hardware where the data is stored in order to improve performance. 也就是说，Map任务在存储数据的同一硬件上运行，以提高性能。

That said if for some reason your data must be stored outside of HDFS and then processed using MapReduce it can be done but is a bit more work and is not as efficient as processing data in HDFS locally. 这就是说，如果由于某种原因您的数据必须存储在HDFS之外，然后使用MapReduce处理，虽然可以完成，但要花点功夫，并且效率不如本地处理HDFS中的数据。

So lets take two use cases. 因此，我们举两个用例。 Start with log files. 从日志文件开始。 Log files as they are are not particularly accessible. 由于不是特别可访问的日志文件。 They just need to be stuck somewhere and stored for later analysis. 它们只需要被卡在某个地方并存储起来以备以后分析。 HDFS is perfect for this. HDFS非常适合此功能。 If you really need a log back out you can get it but generally people will be looking for the output of the analytics. 如果您确实需要注销，则可以获取，但通常人们会在寻找分析结果。 So store your logs in HDFS and process them normally. 因此，将您的日志存储在HDFS中并正常处理。

However, data in the format ideal for HDFS and Hadoop Map Reduce (many records in a single large flat file) is not what I would consider highly accessible. 但是，对于HDFS和Hadoop Map Reduce而言，理想格式的数据（许多记录在一个大的平面文件中）并不是我认为高度可访问的数据。 Hadoop Map Reduce expects to have input files that are multi megabytes in size with many records per file. Hadoop Map Reduce期望输入文件的大小为数MB，每个文件有很多记录。 The more you diverge from this case, the more your performance will decline. 您与这种情况的分歧越多，您的表现就会越下降。 Sometimes your data is needed online at all times, and HDFS is not ideal for this. 有时，您的数据始终需要在线，而HDFS并不是理想的选择。 For instance we will use your book example. 例如，我们将使用您的书籍示例。 If these books are used in an application that needs the content accessible in an online fashion, IE editting and annotating, you may choose to store them in a database. 如果这些书用在需要以在线方式进行内容访问，IE编辑和注释的应用程序中，则可以选择将它们存储在数据库中。 Then when you need to run batch analytics you use a custom InputFormat to retrieve the records from the database and process them in MapReduce. 然后，当您需要运行批处理分析时，可以使用自定义InputFormat从数据库中检索记录并在MapReduce中进行处理。

I am currently doing this with a web crawler that stores the web pages individually in Amazon S3. 我目前正在使用Web爬虫来执行此操作，该爬虫将网页分别存储在Amazon S3中。 Web pages are too small to serve as a single efficient input to MapReduce, so I have a custom InputFormat that feeds each mapper several files. 网页太小，无法用作MapReduce的单个有效输入，因此我有一个自定义的InputFormat，可以为每个映射器提供几个文件。 The output of this MapReduce job is eventually written back to S3, and because I am using Amazon EMR, the Hadoop cluster goes away. 该MapReduce作业的输出最终被写回到S3，并且因为我使用的是Amazon EMR，所以Hadoop集群消失了。