I have a project where the requirement is to download files, in a distributed manner, from external sources. We already have a large investment in Hadoop and looking to leverage MapReduce --but more as a distributed task than ETL.
1) Has anyone done this before?
2) Should there be just a Mapper without a Reducer?
3) Whats the best way to pass in a abstract implementation of a FTP/HTTP connection to the Mapper? -- Just to be clear, what I was getting at is that I want a good way to unit test this without doing a integration test thus needing a way to mock FTP/HTTP.
4) Is MapReduce the best method for this type of thing? -- are we abusing MapReduce?
Thank you.
This 'sounds' similar to what Nutch does (although i'm not too familiar with Nutch beyond that statement).
Some points for observation:
I think you should take a look at Storm. It's a scalable framework that's very useful for data collection from many different sources. This is really what you're trying to do. Processing can still be done using map reduce, but for the actual collection you should use a framework like Storm.
I think your internet connection will easily become a bottleneck in this case but I'm sure it can be done.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.