简体   繁体   English

Node.js:复制 csv 时文件大小的差异

[英]Node.js: difference in file size when copying csv

I have the below code where I am reading from a CSV and writing to another CSV.我有以下代码,我从 CSV 读取并写入另一个 CSV。 I will be transforming some data before writing to another file, but as a test, I ran the code and see that there are slight differences between source and destination files without event changing anything about the files.在写入另一个文件之前,我将转换一些数据,但作为测试,我运行了代码,发现源文件和目标文件之间存在细微差别,而没有事件更改文件的任何内容。

  for(const m of metadata) {
      tempm = m;
      fname = path;
      const pipelineAsync = promisify(pipeline);
      if(m.path) {
        await pipelineAsync(
          fs.createReadStream(m.path),
          csv.parse({delimiter: '\t', columns: true}),
          csv.transform((input) => {
            return Object.assign({}, input);
          }),
          csv.stringify({header: true, delimiter: '\t'}),
          fs.createWriteStream(fname, {encoding: 'utf16le'})
        )
        let nstats = fs.statSync(fname);
        tempm['transformedPath'] = fname;
        tempm['transformed'] = true;
        tempm['t_size_bytes'] = nstats.size;
      }
  }

I see that for example,例如,我看到,

file a: the source file size is `895631` while after copying destination file size is `898545`
file b: the source file size is `51388` while after copying destination file size is `52161`
file c: the source file size is `13666` while after copying destination file size is `13587`

But when i do not use tranform, the sizes match, for example this code produces excatly same file sizes on both source and dest但是当我不使用转换时,大小匹配,例如此代码在源和目标上产生完全相同的文件大小


  for(const m of metadata) {
      tempm = m;
      fname = path;
      const pipelineAsync = promisify(pipeline);
      if(m.path) {
        await pipelineAsync(
          fs.createReadStream(m.path),
          /*csv.parse({delimiter: '\t', columns: true}),
          csv.transform((input) => {
            return Object.assign({}, input);
          }),
          csv.stringify({header: true, delimiter: '\t'}),*/
          fs.createWriteStream(fname, {encoding: 'utf16le'})
        )
        let nstats = fs.statSync(fname);
        tempm['transformedPath'] = fname;
        tempm['transformed'] = true;
        tempm['t_size_bytes'] = nstats.size;
      }
  }

Can any one please help in identifying what options i need to pass to csv transformation, so that the copy happens correctly.任何人都可以帮助确定我需要将哪些选项传递给 csv 转换,以便正确进行复制。

I am doing this test to ensure, i am not losing out any data in large files.我正在做这个测试以确保我不会丢失大文件中的任何数据。

Thanks.谢谢。

Update 1: I have also checked that the encoding on both the files is same.更新 1:我还检查了两个文件的编码是否相同。

Update 2: I notice that the the source file has CRLF and destination file has LF .更新 2:我注意到源文件有CRLF和目标文件有LF Is there a way i can keep the same using node.js or is it something OS dependent.有没有办法我可以使用 node.js 保持相同,或者它是否依赖于OS

Update 3: Looks like the issue is EOL , I see the source file has CRLF while the destination file / transformed file has LF .更新 3:看起来问题是EOL ,我看到源文件有CRLF而目标文件/转换后的文件有LF I need to now find a way to specify this my above code so that the EOL is consistent我现在需要找到一种方法来指定我上面的代码,以便EOL是一致的

You need to setup you EOL config:您需要设置 EOL 配置:

const { pipeline } = require('stream')
const { promisify } = require('util')
const fs = require('fs')
const csv = require('csv')
const os = require('os')


;(async function () {
  const pipelineAsync = promisify(pipeline)
  await pipelineAsync(
    fs.createReadStream('out'),
    csv.parse({ delimiter: ',', columns: true }),
    csv.transform((input) => {
      return Object.assign({}, input)
    }),
    // Here the trick:
    csv.stringify({ eol: true, record_delimiter: os.EOL, header: true, delimiter: '\t' }),
    fs.createWriteStream('out2', { encoding: 'utf16le' })
  )
})()

You can use \r\n as well or whatever you need $-new-line\n您也可以使用\r\n或任何您需要$-new-line\n

This setup can be spotted reading the source code .阅读源代码可以发现此设置。

The two main source of this kind of difference are:这种差异的两个主要来源是:

  1. EOL style (unix or ms-dos) EOL 样式(unix 或 ms-dos)
  2. file encoding文件编码

Using the simple unix file command line utility you can check both the encoding and EOL style for source files.使用简单的 unix file命令行实用程序,您可以检查源文件的编码和 EOL 样式。 Make sure to use same options for dest files and any difference should disappear.确保对 dest 文件使用相同的选项,任何差异都应该消失。

Hope this helps.希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM