简体   繁体   中英

Hadoop and jgit convert between java.io.file and dataoutputstream

Hello Im trying to run map reduce jobs on git repositories. I wanted to use a map job to first concurrently clone all repositories to hdfs then do further map reduce jobs on the files. Im running into a problem in that Im not sure how to write the repository files to hdfs. I have seen examples which write individual files but those were outside the mapper and only write single files. The jgit api only exposes a filerepository structure which inherits from file but the hdfs uses paths written as dataoutputstreams. Is there a good way to convert between the two or any examples that do something similar?

Thanks

The input data to Hadoop Mapper must be on HDFS and not on your local machine or anything other than HDFS. Map-reduce jobs are not meant for migrating data from one place to another. They are used to process huge volumes of data present on HDFS. I am sure that your repository data in not HDFS, and if it is then you wont have needed to perform any operation at first place. So please keep in mind that map-reduce jobs are used for processing large volumes of data already present on HDFS (Hadoop file system).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM