简体   繁体   English

使用Node.js将超过1m条记录从SQL Server流式传输到MongoDB

[英]Streaming over 1m records from SQL Server to MongoDB using Node.js

I'm trying to copy 8,000,000 rows of data from Microsoft SQL Sever into MongoDB. 我正在尝试将8,000,000行数据从Microsoft SQL Sever复制到MongoDB。 It works great for 100,000 records, but when I try to pull 1,000,000 records (or all), I get the following error: 它非常适合100,000条记录,但是当我尝试提取1,000,000条记录(或全部)时,出现以下错误:

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory 严重错误:CALL_AND_RETRY_LAST分配失败-内存不足

Here's the code (Coffeescript) I'm currently using: 这是我当前使用的代码(咖啡):

MsSqlClient   = require 'mssql'
MongoClient = require('mongodb').MongoClient

config = {}
config.mongodb = 'mongodb://localhost:27017/dbname'
config.mssql = 'mssql://user::pass@host/dbname'

Promise.all(
  [
    MongoClient.connect config.mongodb
    MsSqlClient.connect config.mssql
  ]
).then (a) ->
  mongo = a[0]
  sql = a[1]

  collection = mongo.collection "collection_name"

  request = new MsSqlClient.Request()
  request.stream = true

  request.on 'row', (row) ->
    collection.insert(row)

  request.on 'done', (affected) ->
    console.log "Completed"

  sql.on 'error', (err) ->
    console.log err

  console.log "Querying"
  request.query("SELECT * FROM big_table")

.catch (err) ->
  console.log "ERROR: ", err

It seems that the write to MongoDB is taking way longer than the download from SQL Server which I believe to be causing a bottleneck. 似乎对MongoDB的写入要比从SQL Server进行下载花费的时间更长,我认为这是造成瓶颈的原因。 Is there a way to slow down (pause/resume) the stream from SQL Server so I can pull and write in chunks without adding an index column in the SQL data and selecting by row number? 有没有一种方法可以减慢(暂停/恢复)来自SQL Server的流,这样我就可以在不添加SQL数据中的索引列和按行号进行选择的情况下在块中进行读写操作?

Running: 正在运行:

  • Windows 7, SQL Server 2012 (SP1), MongoDB 2.8.0 Windows 7,SQL Server 2012(SP1),MongoDB 2.8.0
  • Node.js 4.2.4 / mssql 3.3.0 / mongodb 2.1.19 Node.js 4.2.4 / mssql 3.3.0 / mongodb 2.1.19

You could do it in blocks (50'000 for example). 您可以分块进行(例如50'000)。 Here a way (SQL side only) how you can do it (not super fast but should work): 这里是一种方法(仅在SQL方面)如何实现(不是很快,但应该可以工作):

Get blocks first, these number you have to loop outside of SQL: 首先获取块,这些数字必须在SQL之外循环:

    -- get blocks

    select count(*) / 50000 as NumberOfBlocksToLoop
    from YOUR.TABLE

Get the block, where ColumnU is a column that allows you to sort your table (alternatively, you could use an ID directly, but then you might have the problem of gaps if data are being deleted from the table): 获取块,其中ColumnU是允许您对表进行排序的列(或者,您可以直接使用ID,但是如果从表中删除数据,则可能会出现间隙问题):

    -- get first n-block

    declare @BlockNumber int

    set @BlockNumber = 1

    select ColumnX
    from
    (
        select row_number() over (order by ColumnU asc) as RowNumber,
        TABLE.ColumnX
        from YOUR.TABLE
    ) Data
    where RowNumber between ((@BlockNumber - 1) * 50000) + 1 and @BlockNumber * 50000

Try to find a good size for your block (of course depends on your system) to avoid running into out of memory exception again. 尝试为您的块找到合适的大小(当然取决于您的系统),以避免再次遇到内存不足异常。 You should catch the exception, and then, based on your effort, either delete the already transferred data or calculate a smaller block size (a bit more difficult) and continue with the rest to transfer. 您应该捕获该异常,然后根据您的努力,删除已传输的数据或计算较小的块大小(难度稍大),然后继续其余的传输。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM