简体   繁体   中英

Sorting and diffing large files with Node.js

I have two files with a UUID on each line. Each file has several hundred thousand lines (they're generated from database dumps). These files need to be sorted and differences found (additions/removals). This is easy to do using a few *nix tools and it only takes a few seconds:

$ sort file-a.txt > file-a-sorted.txt
$ sort file-b.txt > file-b-sorted.txt
$ diff file-a-sorted.txt file-b-sorted.txt

However I'd like to add this functionality to a CLI we have (built on Node) that's intended for multiplatform use. So spawning a subprocess and delegating to these tools is not an option.

Being 'dumb' and loading each file into memory, splitting on newlines and calling .sort() on the resulting array works surprisingly well (it's quick albeit using quite a lot of memory...) but finding the differences is proving rather harder.

I'm sure the answer lies somewhere in the realm of streams but I lack experience manipulating them so I'm unsure where to begin.

What are efficient techniques to load, sort and diff such large files using Node.js?

I'm not looking for full solutions (though, feel free!), just pointers would be really useful at this stage.

Thanks!

Since you already have the files in memory as a sorted array, check out difflib .

This seems to fit exactly your use case:

>>> difflib.unifiedDiff('one two three four'.split(' '),
...                     'zero one tree four'.split(' '), {
...                       fromfile: 'Original'
...                       tofile: 'Current',
...                       fromfiledate: '2005-01-26 23:30:50',
...                       tofiledate: '2010-04-02 10:20:52',
...                       lineterm: ''
...                     })
[ '--- Original\t2005-01-26 23:30:50',
  '+++ Current\t2010-04-02 10:20:52',
  '@@ -1,4 +1,4 @@',
  '+zero',
  ' one',
  '-two',
  '-three',
  '+tree',
  ' four' ]

In the end we went for something very simple using sets which, unlike arrays, remain extremely performant and memory efficient even with many thousands of entries. This is our initial test code:

const fs = require('fs')
const readline = require('readline')

const memory = () => process.memoryUsage().rss / 1048576).toFixed(2)

const loadFile = (filename, cb) => {
  // this is more complex that simply calling fs.readFile() but
  // means we do not have to buffer the whole file in memory  
  return new Promise((resolve, reject) => {
    const input = fs.createReadStream(filename)
    const reader = readline.createInterface({ input })

    input.on('error', reject)

    reader.on('line', cb)
    reader.on('close', resolve)
  })
}

const start = Date.now()

const uniqueA = new Set()
const uniqueB = new Set()

// when reading the first file add every line to the set
const handleA = (line) => {
  uniqueA.add(line)
}

// this will leave us with unique lines only
const handleB = (line) => {
  if (uniqueA.has(line)) {
    uniqueA.delete(line)
  } else {
    uniqueB.add(line)
  }
}

console.log(`Starting memory: ${memory()}mb`)

Promise.resolve()
  .then(() => loadFile('uuids-eu.txt', handleA))
  .then(() => {
    console.log(`${uniqueA.size} items loaded into set`)
    console.log(`Memory: ${memory()}mb`)
  })
  .then(() => loadFile('uuids-us.txt', handleB))
  .then(() => {
    const end = Date.now()

    console.log(`Time taken: ${(end - start) / 1000}s`)
    console.log(`Final memory: ${memory()}mb`)

    console.log('Differences A:', Array.from(uniqueA))
    console.log('Differences B:', Array.from(uniqueB))
  })

Which gives us this output (2011 Macbook Air):

Starting memory: 19.71mb
678336 items loaded into set
Memory: 135.95mb
Time taken: 1.918s
Final memory: 167.06mb
Differences A: [ ... ]
Differences B: [ ... ]

Using the 'dumb' method of loading the file and splitting on newlines is even faster (~1.2s) but with significantly higher memory overhead (~2x).

Our solution using Set also has the advantage that we can skip the sorting step, making this faster too than the *nix tools outlined in the original question.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM