简体   繁体   English

在客户端用 JavaScript 逐行读取文件

[英]Reading line-by-line file in JavaScript on client side

Could you please help me with following issue.你能帮我解决以下问题吗?

Goal目标

Read file on client side (in browser via JS and HTML5 classes) line by line, without loading whole file to memory.在客户端(在浏览器中通过 JS 和 HTML5 类)逐行读取文件,无需将整个文件加载到内存中。

Scenario设想

I'm working on web page which should parse files on client side.我正在处理应该在客户端解析文件的网页。 Currently, I'm reading file as it described in this article .目前,我正在阅读本文中描述的文件。

HTML: HTML:

<input type="file" id="files" name="files[]" />

JavaScript: JavaScript:

$("#files").on('change', function(evt){
    // creating FileReader
    var reader = new FileReader();

    // assigning handler
    reader.onloadend = function(evt) {      
        lines = evt.target.result.split(/\r?\n/);

        lines.forEach(function (line) {
            parseLine(...);
        }); 
    };

    // getting File instance
    var file = evt.target.files[0];

    // start reading
    reader.readAsText(file);
}

The problem is that FileReader reads whole file at once, which causes crashed tab for big files (size >= 300 MB).问题是 FileReader 一次读取整个文件,这会导致大文件(大小 >= 300 MB)的选项卡崩溃。 Using reader.onprogress doesn't solve a problem, as it just increments a result till it will hit the limit.使用reader.onprogress并不能解决问题,因为它只会增加结果直到达到极限。

Inventing a wheel发明轮子

I've done some research in internet and have found no simple way to do this (there are bunch of articles describing this exact functionality but on server side for node.js).我在互联网上做了一些研究,并没有找到简单的方法来做到这一点(有很多文章描述了这个确切的功能,但在 node.js 的服务器端)。

As only way to solve it I see only following:作为解决它的唯一方法,我只看到以下内容:

  1. Split file by chunks (via File.split(startByte, endByte) method)按块分割文件(通过File.split(startByte, endByte)方法)
  2. Find last new line character in that chunk ('/n')查找该块中的最后一个新行字符 ('/n')
  3. Read that chunk except part after last new line character and convert it to the string and split by lines读取除最后一个换行符之后的部分之外的块并将其转换为字符串并按行拆分
  4. Read next chunk starting from last new line character found on step 2从第 2 步找到的最后一个换行符开始读取下一个块

But I'll better use something already existing to avoid entropy growth.但我会更好地使用已经存在的东西来避免熵增长。

Eventually I've created new line-by-line reader, which is totally different from previous one.最终我创建了新的逐行阅读器,它与以前的完全不同。

Features are:特点是:

  • Index-based access to File (sequential and random)基于索引的文件访问(顺序和随机)
  • Optimized for repeat random reading (milestones with byte offset saved for lines already navigated in past), so after you've read all file once, accessing line 43422145 will be almost as fast as accessing line 12.针对重复随机读取进行了优化(为过去已经导航过的行保存了字节偏移的里程碑),因此在您读取所有文件一次后,访问第 43422145 行几乎与访问第 12 行一样快。
  • Searching in file: find next and find all .在文件中搜索: find nextfind all
  • Exact index, offset and length of matches, so you can easily highlight them匹配的精确索引、偏移量和长度,因此您可以轻松突出显示它们

Check this jsFiddle for examples.检查这个jsFiddle的例子。

Usage:用法:

// Initialization
var file; // HTML5 File object
var navigator = new FileNavigator(file);

// Read some amount of lines (best performance for sequential file reading)
navigator.readSomeLines(startingFromIndex, function (err, index, lines, eof, progress) { ... });

// Read exact amount of lines
navigator.readLines(startingFromIndex, count, function (err, index, lines, eof, progress) { ... });

// Find first from index
navigator.find(pattern, startingFromIndex, function (err, index, match) { ... });

// Find all matching lines
navigator.findAll(new RegExp(pattern), indexToStartWith, limitOfMatches, function (err, index, limitHit, results) { ... });

Performance is same to previous solution.性能与之前的解决方案相同。 You can measure it invoking 'Read' in jsFiddle.您可以在 jsFiddle 中调用“读取”来测量它。

GitHub: https://github.com/anpur/client-line-navigator/wiki GitHub: https://github.com/anpur/client-line-navigator/wiki

Update: check LineNavigator from my second answer instead, that reader is way better.更新:改为从我的第二个答案中检查LineNavigator ,该阅读器更好。

I've made my own reader, which fulfills my needs.我制作了自己的阅读器,满足了我的需求。

Performance表现

As the issue is related only to huge files performance was the most important part.由于该问题仅与大文件有关,因此性能是最重要的部分。在此处输入图片说明

As you can see, performance is almost the same as direct read (as described in question above).如您所见,性能几乎与直接读取相同(如上述问题所述)。 Currently I'm trying to make it better, as bigger time consumer is async call to avoid call stack limit hit, which is not unnecessary for execution problem. 目前我正在努力让它变得更好,因为更大的时间消费者是异步调用以避免调用堆栈限制命中,这对于执行问题来说不是必需的。 Performance issue solved.性能问题解决了。

Quality质量

Following cases were tested:测试了以下案例:

  • Empty file空的文件
  • Single line file单行文件
  • File with new line char on the end and without文件末尾有换行符,没有
  • Check parsed lines检查解析的行
  • Multiple runs on same page在同一页面上多次运行
  • No lines are lost and no order problems没有线路丢失,没有订单问题

Code & Usage代码和用法

Html:网址:

<input type="file" id="file-test" name="files[]" />
<div id="output-test"></div>

Usage:用法:

$("#file-test").on('change', function(evt) {
    var startProcessing = new Date();
    var index = 0;
    var file = evt.target.files[0];
    var reader = new FileLineStreamer();
    $("#output-test").html("");

    reader.open(file, function (lines, err) {
        if (err != null) {
            $("#output-test").append('<span style="color:red;">' + err + "</span><br />");
            return;
        }
        if (lines == null) {
            var milisecondsSpend = new Date() - startProcessing;
            $("#output-test").append("<strong>" + index + " lines are processed</strong> Miliseconds spend: " + milisecondsSpend + "<br />");           
            return;
        }

        // output every line
        lines.forEach(function (line) {
            index++;
            //$("#output-test").append(index + ": " + line + "<br />");
        });
        
        reader.getNextBatch();
    });
    
    reader.getNextBatch();  
});

Code:代码:

function FileLineStreamer() {   
    var loopholeReader = new FileReader();
    var chunkReader = new FileReader(); 
    var delimiter = "\n".charCodeAt(0); 
    
    var expectedChunkSize = 15000000; // Slice size to read
    var loopholeSize = 200;         // Slice size to search for line end

    var file = null;
    var fileSize;   
    var loopholeStart;
    var loopholeEnd;
    var chunkStart;
    var chunkEnd;
    var lines;
    var thisForClosure = this;
    var handler;
    
    // Reading of loophole ended
    loopholeReader.onloadend = function(evt) {
        // Read error
        if (evt.target.readyState != FileReader.DONE) {
            handler(null, new Error("Not able to read loophole (start: )"));
            return;
        }
        var view = new DataView(evt.target.result);
        
        var realLoopholeSize = loopholeEnd - loopholeStart;     
        
        for(var i = realLoopholeSize - 1; i >= 0; i--) {                    
            if (view.getInt8(i) == delimiter) {
                chunkEnd = loopholeStart + i + 1;
                var blob = file.slice(chunkStart, chunkEnd);
                chunkReader.readAsText(blob);
                return;
            }
        }
        
        // No delimiter found, looking in the next loophole
        loopholeStart = loopholeEnd;
        loopholeEnd = Math.min(loopholeStart + loopholeSize, fileSize);
        thisForClosure.getNextBatch();
    };
    
    // Reading of chunk ended
    chunkReader.onloadend = function(evt) {
        // Read error
        if (evt.target.readyState != FileReader.DONE) {
            handler(null, new Error("Not able to read loophole"));
            return;
        }
        
        lines = evt.target.result.split(/\r?\n/);       
        // Remove last new line in the end of chunk
        if (lines.length > 0 && lines[lines.length - 1] == "") {
            lines.pop();
        }
        
        chunkStart = chunkEnd;
        chunkEnd = Math.min(chunkStart + expectedChunkSize, fileSize);
        loopholeStart = Math.min(chunkEnd, fileSize);
        loopholeEnd = Math.min(loopholeStart + loopholeSize, fileSize);
                
        thisForClosure.getNextBatch();
    };
    
    this.getProgress = function () {
        if (file == null)
            return 0;
        if (chunkStart == fileSize)
            return 100;         
        return Math.round(100 * (chunkStart / fileSize));
    }

    // Public: open file for reading
    this.open = function (fileToOpen, linesProcessed) {
        file = fileToOpen;
        fileSize = file.size;
        loopholeStart = Math.min(expectedChunkSize, fileSize);
        loopholeEnd = Math.min(loopholeStart + loopholeSize, fileSize);
        chunkStart = 0;
        chunkEnd = 0;
        lines = null;
        handler = linesProcessed;
    };

    // Public: start getting new line async
    this.getNextBatch = function() {
        // File wasn't open
        if (file == null) {     
            handler(null, new Error("You must open a file first"));
            return;
        }
        // Some lines available
        if (lines != null) {
            var linesForClosure = lines;
            setTimeout(function() { handler(linesForClosure, null) }, 0);
            lines = null;
            return;
        }
        // End of File
        if (chunkStart == fileSize) {
            handler(null, null);
            return;
        }
        // File part bigger than expectedChunkSize is left
        if (loopholeStart < fileSize) {
            var blob = file.slice(loopholeStart, loopholeEnd);
            loopholeReader.readAsArrayBuffer(blob);
        }
        // All file can be read at once
        else {
            chunkEnd = fileSize;
            var blob = file.slice(chunkStart, fileSize);
            chunkReader.readAsText(blob);
        }
    };
};

I have written a module named line-reader-browser for the same purpose.为了同样的目的,我编写了一个名为line-reader-browser的模块。 It uses Promises .它使用Promises

Syntax (Typescript):-语法(打字稿):-

import { LineReader } from "line-reader-browser"

// file is javascript File Object returned from input element
// chunkSize(optional) is number of bytes to be read at one time from file. defaults to 8 * 1024
const file: File
const chunSize: number
const lr = new LineReader(file, chunkSize)

// context is optional. It can be used to inside processLineFn   
const context = {}
lr.forEachLine(processLineFn, context)
  .then((context) => console.log("Done!", context))

// context is same Object as passed while calling forEachLine
function processLineFn(line: string, index: number, context: any) {
   console.log(index, line)
}

Usage:-用法:-

import { LineReader } from "line-reader-browser"

document.querySelector("input").onchange = () => {
   const input = document.querySelector("input")
   if (!input.files.length) return
   const lr = new LineReader(input.files[0], 4 * 1024)
   lr.forEachLine((line: string, i) => console.log(i, line)).then(() => console.log("Done!"))
}

Try following code snippet to see module working.尝试以下代码片段以查看模块工作。

 <html> <head> <title>Testing line-reader-browser</title> </head> <body> <input type="file"> <script src="https://cdn.rawgit.com/Vikasg7/line-reader-browser/master/dist/tests/bundle.js"></script> </body> </html>

Hope it saves someone's time! 希望它可以节省某人的时间!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM