简体   繁体   English

使用Mongoose模式导入CSV

[英]Import CSV Using Mongoose Schema

Currently I need to push a large CSV file into a mongo DB and the order of the values needs to determine the key for the DB entry: 当前,我需要将一个大型CSV文件推送到mongo数据库中,并且值的顺序需要确定数据库条目的键:

Example CSV file: CSV文件示例:

9,1557,358,286,Mutantville,4368,2358026,,M,0,0,0,1,0
9,1557,359,147,Wroogny,4853,2356061,,D,0,0,0,1,0

Code to parse it into arrays: 将其解析为数组的代码:

var fs = require("fs");

var csv = require("fast-csv");

fs.createReadStream("rank.txt")
    .pipe(csv())
    .on("data", function(data){
        console.log(data);
    })
    .on("end", function(data){
        console.log("Read Finished");
    });

Code Output: 代码输出:

[ '9',
  '1557',
  '358',
  '286',
  'Mutantville',
  '4368',
  '2358026',
  '',
  'M',
  '0',
  '0',
  '0',
  '1',
  '0' ]
[ '9',
  '1557',
  '359',
  '147',
  'Wroogny',
  '4853',
  '2356061',
  '',
  'D',
  '0',
  '0',
  '0',
  '1',
  '0' ]

How do I insert the arrays into my mongoose schema to go into mongo db? 如何将数组插入到Mongoose模式中以进入Mongo DB?

Schema: 架构:

var mongoose = require("mongoose");


var rankSchema = new mongoose.Schema({
   serverid: Number,
   resetid: Number,
   rank: Number,
   number: Number,
   name: String,
   land: Number,
   networth: Number,
   tag: String,
   gov: String,
   gdi: Number,
   protection: Number,
   vacation: Number,
   alive: Number,
   deleted: Number
});

module.exports = mongoose.model("Rank", rankSchema);

The order of the array needs to match the order of the schema for instance in the array the first number 9 needs to always be saved as they key "serverid" and so forth. 数组的顺序需要与架构的顺序匹配,例如在数组中,第一个数字9始终需要保存,因为它们键为“ serverid”,依此类推。 I'm using Node.JS 我正在使用Node.JS

You can do it with fast-csv by getting the headers from the schema definition which will return the parsed lines as "objects". 您可以使用fast-csv来实现,方法是从架构定义中获取headers ,该headers会将解析后的行作为“对象”返回。 You actually have some mismatches, so I've marked them with corrections: 您实际上有一些不匹配,因此我已将它们标记为更正:

const fs = require('mz/fs');
const csv = require('fast-csv');

const { Schema } = mongoose = require('mongoose');

const uri = 'mongodb://localhost/test';

mongoose.Promise = global.Promise;
mongoose.set('debug', true);

const rankSchema = new Schema({
  serverid: Number,
  resetid: Number,
  rank: Number,
  name: String,
  land: String,         // <-- You have this as Number but it's a string
  networth: Number,
  tag: String,
  stuff: String,        // the empty field in the csv
  gov: String,
  gdi: Number,
  protection: Number,
  vacation: Number,
  alive: Number,
  deleted: Number
});

const Rank = mongoose.model('Rank', rankSchema);

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const conn = await mongoose.connect(uri);

    await Promise.all(Object.entries(conn.models).map(([k,m]) => m.remove()));

    let headers = Object.keys(Rank.schema.paths)
      .filter(k => ['_id','__v'].indexOf(k) === -1);

    console.log(headers);

    await new Promise((resolve,reject) => {

      let buffer = [],
          counter = 0;

      let stream = fs.createReadStream('input.csv')
        .pipe(csv({ headers }))
        .on("error", reject)
        .on("data", async doc => {
          stream.pause();
          buffer.push(doc);
          counter++;
          log(doc);
          try {
            if ( counter > 10000 ) {
              await Rank.insertMany(buffer);
              buffer = [];
              counter = 0;
            }
          } catch(e) {
            stream.destroy(e);
          }

          stream.resume();

        })
        .on("end", async () => {
          try {
            if ( counter > 0 ) {
              await Rank.insertMany(buffer);
              buffer = [];
              counter = 0;
              resolve();
            }
          } catch(e) {
            stream.destroy(e);
          }
        });

    });


  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }


})()

As long as the schema actually lines up to the provided CSV then it's okay. 只要该模式实际上与提供的CSV对齐,就可以了。 These are the corrections that I can see but if you need the actual field names aligned differently then you need to adjust. 这些是我可以看到的更正,但是如果您需要将实际字段名称进行不同的对齐,则需要进行调整。 But there was basically a Number in the position where there is a String and essentially an extra field, which I'm presuming is the blank one in the CSV. 但是基本上在一个String的位置上有一个Number ,本质上是一个额外的字段,我想这是CSV中的空白字段。

The general things are getting the array of field names from the schema and passing that into the options when making the csv parser instance: 一般的事情是从架构中获取字段名称数组,并在制作csv解析器实例时将其传递到选项中:

let headers = Object.keys(Rank.schema.paths)
  .filter(k => ['_id','__v'].indexOf(k) === -1);

let stream = fs.createReadStream('input.csv')
  .pipe(csv({ headers }))

Once you actually do that then you get an "Object" back instead of an array: 实际执行此操作后,您将获得“对象”而不是数组:

{
  "serverid": "9",
  "resetid": "1557",
  "rank": "358",
  "name": "286",
  "land": "Mutantville",
  "networth": "4368",
  "tag": "2358026",
  "stuff": "",
  "gov": "M",
  "gdi": "0",
  "protection": "0",
  "vacation": "0",
  "alive": "1",
  "deleted": "0"
}

Don't worry about the "types" because Mongoose will cast the values according to schema. 不要担心“类型”,因为Mongoose会根据模式强制转换值。

The rest happens within the handler for the data event. 其余的发生在data事件的处理程序中。 For maximum efficiency we are using insertMany() to only write to the database once every 10,000 lines. 为了获得最大效率,我们使用insertMany()每10,000行仅写入数据库一次。 How that actually goes to the server and processes depends on the MongoDB version, but 10,000 should be pretty reasonable based on the average number of fields you would import for a single collection in terms of the "trade-off" for memory usage and writing a reasonable network request. 它实际如何到达服务器和进程取决于MongoDB版本,但根据内存使用情况的“权衡”并编写一个单据,您将为单个集合导入的平均字段数应该是10,000,这是相当合理的。合理的网络请求。 Make the number smaller if necessary. 如有必要,减小数字。

The important parts are to mark these calls as async functions and await the result of the insertMany() before continuing. 重要的部分是将这些调用标记为async函数,并在继续之前await insertMany()的结果。 Also we need to pause() the stream and resume() on each item otherwise we run the risk of overwriting the buffer of documents to insert before they are actually sent. 另外,我们还需要在每个项目上分别pause() stream)和resume() ,否则冒着在实际发送文档之前覆盖要插入的文档buffer的风险。 The pause() and resume() are necessary to put "back-pressure" on the pipe, otherwise items just keep "coming out" and firing the data event. 必须使用pause()resume()来在管道上施加“背压”,否则项目将保持“冒出来”并触发data事件。

Naturally the control for the 10,000 entries requires we check that both on each iteration and on stream completion in order to empty the buffer and send any remaining documents to the server. 自然地,对于10,000个条目的控件要求我们在每次迭代和流完成时都进行检查,以清空缓冲区并将所有剩余文档发送到服务器。

That's really what you want to do, as you certainly don't want to fire off an async request to the server both on "every" iteration through the data event or essentially without waiting for each request to complete. 那确实是您要执行的操作,因为您当然不希望在通过data事件的“每次”迭代中触发发送到服务器的异步请求,或者根本不等待每个请求完成。 You'll get away with not checking that for "very small files", but for any real world load you're certain to exceed the call stack due to "in flight" async calls which have not yet completed. 您无需检查“非常小的文件”就可以摆脱困境,但是对于任何实际负载,由于“正在进行中”的异步调用尚未完成,因此您肯定会超出调用堆栈。


FYI - a package.json used. 仅供参考-使用的package.json The mz is optional as it's just a modernized Promise enabled library of standard node "built-in" libraries that I'm simply used to using. mz是可选的,因为它只是现代化的,支持Promise的标准节点“内置”库,我只是习惯使用它。 The code is of course completely interchangeable with the fs module. 该代码当然可以与fs模块完全互换。

{
  "description": "",
  "main": "index.js",
  "dependencies": {
    "fast-csv": "^2.4.1",
    "mongoose": "^5.1.1",
    "mz": "^2.7.0"
  },
  "keywords": [],
  "author": "",
  "license": "ISC"
}

Actually with Node v8.9.x and above then we can even make this much simpler with an implementation of AsyncIterator through the stream-to-iterator module. 实际上,使用Node v8.9.x及更高版本,我们甚至可以通过stream-to-iterator模块使用AsyncIterator的实现来AsyncIterator过程。 It's still in Iterator<Promise<T>> mode, but it should do until Node v10.x becomes stable LTS: 它仍然处于Iterator<Promise<T>>模式,但是应该在Node v10.x变得稳定LTS之前起作用:

const fs = require('mz/fs');
const csv = require('fast-csv');
const streamToIterator = require('stream-to-iterator');

const { Schema } = mongoose = require('mongoose');

const uri = 'mongodb://localhost/test';

mongoose.Promise = global.Promise;
mongoose.set('debug', true);

const rankSchema = new Schema({
  serverid: Number,
  resetid: Number,
  rank: Number,
  name: String,
  land: String,
  networth: Number,
  tag: String,
  stuff: String,        // the empty field
  gov: String,
  gdi: Number,
  protection: Number,
  vacation: Number,
  alive: Number,
  deleted: Number
});

const Rank = mongoose.model('Rank', rankSchema);

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const conn = await mongoose.connect(uri);

    await Promise.all(Object.entries(conn.models).map(([k,m]) => m.remove()));

    let headers = Object.keys(Rank.schema.paths)
      .filter(k => ['_id','__v'].indexOf(k) === -1);

    //console.log(headers);

    let stream = fs.createReadStream('input.csv')
      .pipe(csv({ headers }));

    const iterator = await streamToIterator(stream).init();

    let buffer = [],
        counter = 0;

    for ( let docPromise of iterator ) {
      let doc = await docPromise;
      buffer.push(doc);
      counter++;

      if ( counter > 10000 ) {
        await Rank.insertMany(buffer);
        buffer = [];
        counter = 0;
      }
    }

    if ( counter > 0 ) {
      await Rank.insertMany(buffer);
      buffer = [];
      counter = 0;
    }

  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }

})()

Basically, all of the stream "event" handling and pausing and resuming gets replaced by a simple for loop: 基本上,所有流“事件”的处理,暂停和恢复都将替换为简单的for循环:

const iterator = await streamToIterator(stream).init();

for ( let docPromise of iterator ) {
  let doc = await docPromise;
  // ... The things in the loop
}

Easy! 简单! This gets cleaned up in later node implementation with for..await..of when it becomes more stable. 当它变得更稳定时,将在for..await..of后面的节点实现中for..await..of进行清理。 But the above runs fine on the from the specified version and above. 但以上在指定版本及更高版本上运行良好。

By saying @Neil Lunn need headerline within the CSV itself. 通过说@Neil Lunn需要CSV本身内的标题行

Example using csvtojson module. 使用csvtojson模块的示例。

const csv = require('csvtojson');

const csvArray = [];
  csv()
    .fromFile(file-path)
    .on('json', (jsonObj) => {
      csvArray.push({ name: jsonObj.name, id: jsonObj.id });
    })
    .on('done', (error) => {
      if (error) {
        return res.status(500).json({ error});
      }
          Model.create(csvArray)
      .then((result) => {
         return res.status(200).json({result});
      }).catch((err) => {
          return res.status(500).json({ error});
      });
      });
    });

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM