[英]Import CSV Using Mongoose Schema
Currently I need to push a large CSV file into a mongo DB and the order of the values needs to determine the key for the DB entry: 当前,我需要将一个大型CSV文件推送到mongo数据库中,并且值的顺序需要确定数据库条目的键:
Example CSV file: CSV文件示例:
9,1557,358,286,Mutantville,4368,2358026,,M,0,0,0,1,0
9,1557,359,147,Wroogny,4853,2356061,,D,0,0,0,1,0
Code to parse it into arrays: 将其解析为数组的代码:
var fs = require("fs");
var csv = require("fast-csv");
fs.createReadStream("rank.txt")
.pipe(csv())
.on("data", function(data){
console.log(data);
})
.on("end", function(data){
console.log("Read Finished");
});
Code Output: 代码输出:
[ '9',
'1557',
'358',
'286',
'Mutantville',
'4368',
'2358026',
'',
'M',
'0',
'0',
'0',
'1',
'0' ]
[ '9',
'1557',
'359',
'147',
'Wroogny',
'4853',
'2356061',
'',
'D',
'0',
'0',
'0',
'1',
'0' ]
How do I insert the arrays into my mongoose schema to go into mongo db? 如何将数组插入到Mongoose模式中以进入Mongo DB?
Schema: 架构:
var mongoose = require("mongoose");
var rankSchema = new mongoose.Schema({
serverid: Number,
resetid: Number,
rank: Number,
number: Number,
name: String,
land: Number,
networth: Number,
tag: String,
gov: String,
gdi: Number,
protection: Number,
vacation: Number,
alive: Number,
deleted: Number
});
module.exports = mongoose.model("Rank", rankSchema);
The order of the array needs to match the order of the schema for instance in the array the first number 9 needs to always be saved as they key "serverid" and so forth. 数组的顺序需要与架构的顺序匹配,例如在数组中,第一个数字9始终需要保存,因为它们键为“ serverid”,依此类推。 I'm using Node.JS
我正在使用Node.JS
You can do it with fast-csv by getting the headers
from the schema definition which will return the parsed lines as "objects". 您可以使用fast-csv来实现,方法是从架构定义中获取
headers
,该headers
会将解析后的行作为“对象”返回。 You actually have some mismatches, so I've marked them with corrections: 您实际上有一些不匹配,因此我已将它们标记为更正:
const fs = require('mz/fs');
const csv = require('fast-csv');
const { Schema } = mongoose = require('mongoose');
const uri = 'mongodb://localhost/test';
mongoose.Promise = global.Promise;
mongoose.set('debug', true);
const rankSchema = new Schema({
serverid: Number,
resetid: Number,
rank: Number,
name: String,
land: String, // <-- You have this as Number but it's a string
networth: Number,
tag: String,
stuff: String, // the empty field in the csv
gov: String,
gdi: Number,
protection: Number,
vacation: Number,
alive: Number,
deleted: Number
});
const Rank = mongoose.model('Rank', rankSchema);
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const conn = await mongoose.connect(uri);
await Promise.all(Object.entries(conn.models).map(([k,m]) => m.remove()));
let headers = Object.keys(Rank.schema.paths)
.filter(k => ['_id','__v'].indexOf(k) === -1);
console.log(headers);
await new Promise((resolve,reject) => {
let buffer = [],
counter = 0;
let stream = fs.createReadStream('input.csv')
.pipe(csv({ headers }))
.on("error", reject)
.on("data", async doc => {
stream.pause();
buffer.push(doc);
counter++;
log(doc);
try {
if ( counter > 10000 ) {
await Rank.insertMany(buffer);
buffer = [];
counter = 0;
}
} catch(e) {
stream.destroy(e);
}
stream.resume();
})
.on("end", async () => {
try {
if ( counter > 0 ) {
await Rank.insertMany(buffer);
buffer = [];
counter = 0;
resolve();
}
} catch(e) {
stream.destroy(e);
}
});
});
} catch(e) {
console.error(e)
} finally {
process.exit()
}
})()
As long as the schema actually lines up to the provided CSV then it's okay. 只要该模式实际上与提供的CSV对齐,就可以了。 These are the corrections that I can see but if you need the actual field names aligned differently then you need to adjust.
这些是我可以看到的更正,但是如果您需要将实际字段名称进行不同的对齐,则需要进行调整。 But there was basically a
Number
in the position where there is a String
and essentially an extra field, which I'm presuming is the blank one in the CSV. 但是基本上在一个
String
的位置上有一个Number
,本质上是一个额外的字段,我想这是CSV中的空白字段。
The general things are getting the array of field names from the schema and passing that into the options when making the csv parser instance: 一般的事情是从架构中获取字段名称数组,并在制作csv解析器实例时将其传递到选项中:
let headers = Object.keys(Rank.schema.paths)
.filter(k => ['_id','__v'].indexOf(k) === -1);
let stream = fs.createReadStream('input.csv')
.pipe(csv({ headers }))
Once you actually do that then you get an "Object" back instead of an array: 实际执行此操作后,您将获得“对象”而不是数组:
{
"serverid": "9",
"resetid": "1557",
"rank": "358",
"name": "286",
"land": "Mutantville",
"networth": "4368",
"tag": "2358026",
"stuff": "",
"gov": "M",
"gdi": "0",
"protection": "0",
"vacation": "0",
"alive": "1",
"deleted": "0"
}
Don't worry about the "types" because Mongoose will cast the values according to schema. 不要担心“类型”,因为Mongoose会根据模式强制转换值。
The rest happens within the handler for the data
event. 其余的发生在
data
事件的处理程序中。 For maximum efficiency we are using insertMany()
to only write to the database once every 10,000 lines. 为了获得最大效率,我们使用
insertMany()
每10,000行仅写入数据库一次。 How that actually goes to the server and processes depends on the MongoDB version, but 10,000 should be pretty reasonable based on the average number of fields you would import for a single collection in terms of the "trade-off" for memory usage and writing a reasonable network request. 它实际如何到达服务器和进程取决于MongoDB版本,但根据内存使用情况的“权衡”并编写一个单据,您将为单个集合导入的平均字段数应该是10,000,这是相当合理的。合理的网络请求。 Make the number smaller if necessary.
如有必要,减小数字。
The important parts are to mark these calls as async
functions and await
the result of the insertMany()
before continuing. 重要的部分是将这些调用标记为
async
函数,并在继续之前await
insertMany()
的结果。 Also we need to pause()
the stream and resume()
on each item otherwise we run the risk of overwriting the buffer
of documents to insert before they are actually sent. 另外,我们还需要在每个项目上分别
pause()
stream)和resume()
,否则冒着在实际发送文档之前覆盖要插入的文档buffer
的风险。 The pause()
and resume()
are necessary to put "back-pressure" on the pipe, otherwise items just keep "coming out" and firing the data
event. 必须使用
pause()
和resume()
来在管道上施加“背压”,否则项目将保持“冒出来”并触发data
事件。
Naturally the control for the 10,000 entries requires we check that both on each iteration and on stream completion in order to empty the buffer and send any remaining documents to the server. 自然地,对于10,000个条目的控件要求我们在每次迭代和流完成时都进行检查,以清空缓冲区并将所有剩余文档发送到服务器。
That's really what you want to do, as you certainly don't want to fire off an async request to the server both on "every" iteration through the data
event or essentially without waiting for each request to complete. 那确实是您要执行的操作,因为您当然不希望在通过
data
事件的“每次”迭代中触发发送到服务器的异步请求,或者根本不等待每个请求完成。 You'll get away with not checking that for "very small files", but for any real world load you're certain to exceed the call stack due to "in flight" async calls which have not yet completed. 您无需检查“非常小的文件”就可以摆脱困境,但是对于任何实际负载,由于“正在进行中”的异步调用尚未完成,因此您肯定会超出调用堆栈。
FYI - a package.json
used. 仅供参考-使用的
package.json
。 The mz
is optional as it's just a modernized Promise
enabled library of standard node "built-in" libraries that I'm simply used to using. mz
是可选的,因为它只是现代化的,支持Promise
的标准节点“内置”库,我只是习惯使用它。 The code is of course completely interchangeable with the fs
module. 该代码当然可以与
fs
模块完全互换。
{
"description": "",
"main": "index.js",
"dependencies": {
"fast-csv": "^2.4.1",
"mongoose": "^5.1.1",
"mz": "^2.7.0"
},
"keywords": [],
"author": "",
"license": "ISC"
}
Actually with Node v8.9.x and above then we can even make this much simpler with an implementation of AsyncIterator
through the stream-to-iterator
module. 实际上,使用Node v8.9.x及更高版本,我们甚至可以通过
stream-to-iterator
模块使用AsyncIterator
的实现来AsyncIterator
过程。 It's still in Iterator<Promise<T>>
mode, but it should do until Node v10.x becomes stable LTS: 它仍然处于
Iterator<Promise<T>>
模式,但是应该在Node v10.x变得稳定LTS之前起作用:
const fs = require('mz/fs');
const csv = require('fast-csv');
const streamToIterator = require('stream-to-iterator');
const { Schema } = mongoose = require('mongoose');
const uri = 'mongodb://localhost/test';
mongoose.Promise = global.Promise;
mongoose.set('debug', true);
const rankSchema = new Schema({
serverid: Number,
resetid: Number,
rank: Number,
name: String,
land: String,
networth: Number,
tag: String,
stuff: String, // the empty field
gov: String,
gdi: Number,
protection: Number,
vacation: Number,
alive: Number,
deleted: Number
});
const Rank = mongoose.model('Rank', rankSchema);
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const conn = await mongoose.connect(uri);
await Promise.all(Object.entries(conn.models).map(([k,m]) => m.remove()));
let headers = Object.keys(Rank.schema.paths)
.filter(k => ['_id','__v'].indexOf(k) === -1);
//console.log(headers);
let stream = fs.createReadStream('input.csv')
.pipe(csv({ headers }));
const iterator = await streamToIterator(stream).init();
let buffer = [],
counter = 0;
for ( let docPromise of iterator ) {
let doc = await docPromise;
buffer.push(doc);
counter++;
if ( counter > 10000 ) {
await Rank.insertMany(buffer);
buffer = [];
counter = 0;
}
}
if ( counter > 0 ) {
await Rank.insertMany(buffer);
buffer = [];
counter = 0;
}
} catch(e) {
console.error(e)
} finally {
process.exit()
}
})()
Basically, all of the stream "event" handling and pausing and resuming gets replaced by a simple for
loop: 基本上,所有流“事件”的处理,暂停和恢复都将替换为简单的
for
循环:
const iterator = await streamToIterator(stream).init();
for ( let docPromise of iterator ) {
let doc = await docPromise;
// ... The things in the loop
}
Easy! 简单! This gets cleaned up in later node implementation with
for..await..of
when it becomes more stable. 当它变得更稳定时,将在
for..await..of
后面的节点实现中for..await..of
进行清理。 But the above runs fine on the from the specified version and above. 但以上在指定版本及更高版本上运行良好。
By saying @Neil Lunn need headerline within the CSV itself. 通过说@Neil Lunn需要CSV本身内的标题行 。
Example using csvtojson module. 使用csvtojson模块的示例。
const csv = require('csvtojson');
const csvArray = [];
csv()
.fromFile(file-path)
.on('json', (jsonObj) => {
csvArray.push({ name: jsonObj.name, id: jsonObj.id });
})
.on('done', (error) => {
if (error) {
return res.status(500).json({ error});
}
Model.create(csvArray)
.then((result) => {
return res.status(200).json({result});
}).catch((err) => {
return res.status(500).json({ error});
});
});
});
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.