简体   繁体   English

使用mongoimport将数据从HDFS导入MongoDB

[英]Import data from HDFS to MongoDB using mongoimport

I have a set of files on HDFS. 我在HDFS上有一组文件。 Can I directly load these files into mongoDB (using mongoimport) without copying the files from HDFS to my hard disk. 我可以直接将这些文件加载​​到mongoDB(使用mongoimport),而无需将文件从HDFS复制到我的硬盘。

Have you tried MongoInsertStorage? 你试过MongoInsertStorage吗?

You can simply load the dataset using pig and then use MongoInsertStorage to dump directly into Mongo. 您只需使用pig加载数据集,然后使用MongoInsertStorage直接转储到Mongo。 It internally launches a bunch of mappers that do exactly what is mentioned by 'David Gruzman's answer on this page. 它在内部启动了一系列映射器,完全符合'David Gruzman在本页面上的回答。 One of the advantages of this approach, is the parallelism and speed you achieve due to multiple mappers simultaneously inserting into the Mongo collection. 这种方法的优点之一是由于多个映射器同时插入Mongo集合而实现的并行性和速度。

Here's a rough cut of what can be done with pig 这是对猪可以做些什么的粗略切割

REGISTER mongo-java-driver.jar  
REGISTER mongo-hadoop-core.jar
REGISTER mongo-hadoop-pig.jar

DEFINE MongoInsertStorage com.mongodb.hadoop.pig.MongoInsertStorage();

-- you need this here since multiple mappers could spawn with the same
-- data set and write duplicate records into the collection
SET mapreduce.reduce.speculative false

-- or some equivalent loader
BIG_DATA = LOAD '/the/path/to/your/data' using PigStorage('\t'); 
STORE BIG_DATA INTO 'mongodb://hostname:27017/db USING MongoInsertStorage('', '');

More information here https://github.com/mongodb/mongo-hadoop/tree/master/pig#inserting-directly-into-a-mongodb-collection 更多信息请访问https://github.com/mongodb/mongo-hadoop/tree/master/pig#inserting-directly-into-a-mongodb-collection

If we speak about big data I would look into scalable solutions. 如果我们谈论大数据,我会研究可扩展的解决方案。
We had similar case of serious data set (several terabytes) sitting in HDFS. 我们在HDFS中有类似的严重数据集(几TB)。 This data, although with some transformation was to be loaded into Mongo. 这个数据虽然有一些转换,但是要加载到Mongo中。
What we did was to develop MapReduce Job which run over data and each mapper inserts its split of data into mongodb via API. 我们所做的是开发运行数据的MapReduce Job,每个映射器通过API将其数据分割插入到mongodb中。

Are you storing CSV/JSON files in HDFS? 您是否将CSV / JSON文件存储在HDFS中? If so, you just need some way of mapping them to your filesystem so you can point mongoimport to the file. 如果是这样,您只需要一些方法将它们映射到您的文件系统,这样您就可以将mongoimport指向该文件。

Alternatively mongoimport will take input from stdin unless a file is specified. 或者,除非指定了文件,否则mongoimport将从stdin获取输入。

您可以在没有--file参数的情况下使用mongoimport,并从stdin加载:

hadoop fs -text /path/to/file/in/hdfs/*.csv | mongoimport ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM