繁体   English   中英

在数组中搜索术语并返回包含该术语的数组条目

[英]Search for term in array and return array entry containing that term

我正在制作一个工具来分析单词,并尝试确定何时最常用它们。 我正在使用Google的Ngram数据集。 在我的代码中,我正在流传输这些数据(大约2 GB)。 我将流数据变成一个数组,每行数据都作为一个条目。 我想要做的是在数据中搜索某个单词,并将包含该单词的所有数组条目存储在变量中。 我可以找到单词是否在数据集中,然后将该单词(或单词在数据集中的位置)打印到控制台。 我仍在学习编程,因此如果我的代码混乱,请记住这一点。

 // imports fs (filesystem) package duh const fs = require('fs'); // the data stream const stream = fs.createReadStream("/Users/user/Desktop/authortest_nodejs/testdata/testdata - p"); // gonna use this to keep track of whether ive found the search term or not let found = false; // this is the term the program looks for in the data var search = "proceeded"; // lovely beautiful unclean way of turning my search term into regular expression var searchThing = `\\\\b${search}` var searchRegExp = new RegExp(searchThing, "g"); // starts streaming the test data file stream.on('data', function(data) { // if found is false (my search term isn''t found in this data chunk), set the found variable to true or false depending on whether it found anything if (!found) found = !!('' + data).match(searchRegExp); // turns raw data to a string and tries to find the location of the search term within it var dataLoc = data.toString().search(searchRegExp); var dataStr = data.toString().match(searchRegExp); // if the data search is null, continue streaming (gotta do this cuz if .match() turns up with no results it throws an error smh) if (!dataStr) return; // removes the null spots and line breaks, pretty up the displayed stuff var dataDisplay = dataStr.toString().replace("null", " "); var dataLocDisplay = dataLoc.toString().replace(/(\\r\\n|\\n|\\r)/gm,""); // turns each line of raw data into array var dataArray = data.toString().split("\\n"); // log found instances of search term (dunno why the hell id wanna do that, should fix to something useful) edit: commented it out cuz its too annoying //console.log(dataDisplay); // log location of word in string (there, more useful now?) console.log(dataDisplay); }); // what happens when the stream thing returns an error stream.on('error', function(err) { console.log(err, found); }); // what happens when the stream thing finishes streaming stream.on('close', function(err) { console.log(err, found, searchRegExp); }); 

当前,这会在数据中输出搜索词的每个实例(基本上是一个单词重复一百次左右),但是我需要包含搜索词的整个整行的输出,而不仅是该词。 (“ 2006年5月3日进行”,而不仅仅是“进行”)

据我了解,您正在寻找这样的东西:

const fs = require('fs');

function grep(path, word) {
    return new Promise((resolve) => {
        let
            stream = fs.createReadStream(path, {encoding: 'utf8'}),
            buf = '',
            out = [],
            search = new RegExp(`\\b${word}\\b`, 'i');

        function process(line) {
            if (search.test(line))
                out.push(line);
        }

        stream.on('data', (data) => {
            let lines = data.split('\n');
            lines[0] = buf + lines[0];
            buf = lines.pop();
            lines.forEach(process);
        });

        stream.on('end', () => {
            process(buf);
            resolve(out);
        });
    });
}

// works?
grep(__filename, 'stream').then(lines => console.log(lines))

我想这很简单,需要buf东西来模拟逐行读取(您也可以使用readline或专用模块进行读取)。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM