逐行读取文件而不将整个文件加载到内存中

Question

I am working with a 50 Gb MySQL export file, and performing a list of scripting operations on it to convert to an SQLite3 loadable form (I got the leads from here : script to convert mysql dump sql file into format that can be imported into sqlite3 db ). 我正在使用50 Gb MySQL导出文件，并对其执行脚本操作列表以转换为SQLite3可加载形式（我从这里得到的线索：脚本将mysql dump sql文件转换为可以导入sqlite3的格式db ）。 I have not studied the structure of the MySQL dump, the data was got from a third party. 我尚未研究MySQL转储的结构，数据是从第三方获得的。 I can see that it has create table and insert into statements, but given the size it is hard to manually read through and understand the structure. 我可以看到它具有创建表并插入到语句中的功能，但是鉴于其大小，很难手动阅读并理解其结构。 Piping the file through will not work because of the size. 由于大小原因，无法通过管道传输文件。 Also a bash script to load the file and then process line by line, such as 还有一个bash脚本，用于加载文件，然后逐行处理，例如

while read line
<do something>

complains that it is Out of Memory. 抱怨说它的内存不足。

So I tried to pick each line, using awk or sed (both work), write the line to a file and then pass it through the list of perl scripts. 因此，我尝试使用awk或sed来选择每一行（两者均起作用），将该行写入文件，然后将其通过perl脚本列表。 This is the awk script I am using 这是我正在使用的awk脚本

$ awk -vvar="$x" 'NR==var{print;exit}' file > temp

where x holds the line number and then temp is sent through the perl commands and finally appended to the output file. 其中x保留行号，然后通过perl命令发送temp，最后将它们附加到输出文件中。

However, although fast initially, it quickly slows down as it starts having to iterate over an increasing number of lines from the start. 但是，尽管最初速度很快，但是由于开始必须遍历越来越多的行，因此它很快会变慢。 There are about 40,000 lines. 大约有40,000行。

Has anyone worked with something like this? 有没有人像这样工作？ Is there a faster way of doing this? 有更快的方法吗？

Answer 1

Simply process one line at a time: 一次只需处理一行：

while read -r line
do
    echo "$line" > temp
    …process temp with Perl, etc…
done < file

At least this won't exhibit quadratic behaviour reading the file, which is what your awk script does. 至少这不会表现出二次读取文件的行为，这就是您的awk脚本所做的。 It reads the big file exactly once, which is optimal performance in Big-O notation (within a constant factor). 它只读取一次大文件，这是Big-O表示法的最佳性能（在恒定因子内）。

If, as you say, that causes problems in bash , then you should use Perl to read each line. 如您所说，如果这导致bash出现问题，则应使用Perl读取每一行。 With 40,000 lines in 50 GiB of data, you have about 1¼ MiB of data per line. 在50 GiB数据中有40,000条线，每条线大约有1¼MiB数据。 That is unlikely to cause Perl any problems, though it might perhaps cause Bash problems. 尽管可能会导致Bash问题，但这不太可能导致Perl出现任何问题。 You can either revise the existing Perl to read one line at a time, or use a simple wrapper Perl script that does the job of the Bash script above. 您可以修改现有的Perl一次读取一行，也可以使用简单的包装Perl脚本来完成上述Bash脚本的工作。

`wrapper.pl`

Assuming your current processing script is called script.pl : 假设您当前的处理脚本称为script.pl ：

#!/usr/bin/env perl
use strict;
use warnings;

my $file = "temp";

while (<>)
{
    open my $fh, ">", $file or die;
    print $fh $_;
    close $fh;
    system "perl", "script.pl", $file;
}

^{Untested code} ^{未经测试的代码}

Invocation: 调用：

perl wrapper.pl <file >output

逐行读取文件而不将整个文件加载到内存中

问题描述

1 个解决方案

解决方案1
3 2014-04-26 23:17:57

`wrapper.pl`

逐行读取文件而不将整个文件加载到内存中

问题描述

1 个解决方案

解决方案1 3 2014-04-26 23:17:57

wrapper.pl

解决方案1
3 2014-04-26 23:17:57

`wrapper.pl`