简体繁体 English

Hadoop和jgit在java.io.file和dataoutputstream之间转换

[英]Hadoop and jgit convert between java.io.file and dataoutputstream

原文 2013-03-21 17:03:35 9 1 java/ hadoop

Hello Im trying to run map reduce jobs on git repositories. 您好我正在尝试在git存储库上运行map reduce作业。 I wanted to use a map job to first concurrently clone all repositories to hdfs then do further map reduce jobs on the files. 我想使用一个映射作业，首先将所有存储库同时克隆到hdfs，然后对文件进行进一步的映射精简作业。 Im running into a problem in that Im not sure how to write the repository files to hdfs. 我遇到了一个问题，即我不确定如何将存储库文件写入hdfs。 I have seen examples which write individual files but those were outside the mapper and only write single files. 我已经看到了编写单个文件的示例，但是这些示例不在映射器中，仅写单个文件。 The jgit api only exposes a filerepository structure which inherits from file but the hdfs uses paths written as dataoutputstreams. jgit api仅公开了从文件继承的文件存储库结构，但hdfs使用写为dataoutputstreams的路径。 Is there a good way to convert between the two or any examples that do something similar? 有没有一种好的方法可以在这两个或执行类似操作的示例之间进行转换？

Thanks 谢谢

1 个解决方案

The input data to Hadoop Mapper must be on HDFS and not on your local machine or anything other than HDFS. Hadoop Mapper的输入数据必须在HDFS上，而不是在本地计算机上或HDFS以外的任何其他设备上。 Map-reduce jobs are not meant for migrating data from one place to another. 地图缩减作业并不意味着将数据从一个地方迁移到另一个地方。 They are used to process huge volumes of data present on HDFS. 它们用于处理HDFS上存在的大量数据。 I am sure that your repository data in not HDFS, and if it is then you wont have needed to perform any operation at first place. 我确信您的存储库数据不是HDFS，如果是的话，您一开始就不需要执行任何操作。 So please keep in mind that map-reduce jobs are used for processing large volumes of data already present on HDFS (Hadoop file system). 因此，请记住， map-reduce作业用于处理HDFS（Hadoop文件系统）上已经存在的大量数据。