简体   繁体   English

如何在Perl中匹配来自两个大文件的数据?

[英]How can I match data from two large files in Perl?

I have 2 (large) files. 我有2个(大)文件。 The first one is about 200k lines, the second one about 30 millions lines. 第一个大约20万行,第二个大约3000万行。

I want to check if each line of the first one is in the second one using Perl. 我想使用Perl检查第一行的每一行是否在第二行中。 Is it faster to compare directly each line of the first to each line of the second or is it better to store them all in two different arrays and then manipulate arrays? 直接比较第一行的每一行和第二行的每一行是否更快,或者将它们全部存储在两个不同的数组中然后进行操作会更好呢?

You have File A and File B. You want to check if lines in File A appear in File B. 您有文件A和文件B。要检查文件A中的行是否出现在文件B中。

If you have enough memory to hold the contents of File B in a hash using one entry per line, that's the simplest. 如果您有足够的内存以每行一个条目的形式将文件B的内容保存在散列中,那是最简单的。 Go ahead. 前进。

However, if you do not, I recommend you put both files in tables in an SQL database. 但是,如果不这样做,建议您将两个文件都放在SQL数据库的表中。 SQLite might be enough to start. SQLite可能足以启动。 Then, your problem is reduced to a simple JOIN . 然后,将您的问题简化为一个简单的JOIN If line length is an issue, use a fast hash such as xxHash . 如果行长是一个问题,请使用快速哈希,例如xxHash If implemented correctly, the 64-bit version is blazing fast on a 64-bit machine, especially if you enabled optimizations in your Perl. 如果正确实现,则64位版本将在64位计算机上快速发展,尤其是在Perl中启用了优化功能时。 Store two columns, hash and the actual line. 存储两列,哈希和实际行。 If hashes match, check if the lines match. 如果哈希匹配,请检查行是否匹配。 Make sure to index on the hash column. 确保在哈希列上建立索引。

You say: 你说:

In fact, my files are like : File A : name number (per line) File B : name date location number (per line) And I have to check if File B contains the lines matching datas of File A (ignoring date and location for example) So it's not an exact match ... 实际上,我的文件就像:文件A:名称编号(每行)文件B:名称日期位置编号(每行)而且我必须检查文件B是否包含与文件A的数据匹配的行(忽略日期和位置)。例如),所以这不是完全匹配...

In that case, you are set. 在这种情况下,您已设置好。 You do not even have to worry about the hash stuff (which I am leaving here for reference). 您甚至不必担心哈希值(我留在这里作为参考)。 Put the interesting bits of data on which you need to match against in separate columns in an SQLite database. 将需要匹配的有趣数据放在SQLite数据库的单独列中。 Write a join. 写一个联接。 ... Profit. ...利润。

Alternatively, you could use BerkeleyDB which gives you the conceptual simplicity of having an in memory hash while storing the table on disk. 或者,您可以使用BerkeleyDB ,它使您在表上存储在磁盘上时具有内存哈希的概念上的简单性。 If you have multiple attributes on which to match, this will not scale well. 如果要匹配多个属性,则扩展性将不佳。

Store the first file's lines in a hash, then iterate through the second file without storing it in memory. 将第一个文件的行存储在哈希中,然后遍历第二个文件而不将其存储在内存中。

It might be counterintuitive to store the first file and iterate the second file as opposed to vice-versa, but it allows you to avoid creating a 30 million element hash. 与第一个文件相反,存储第一个文件并迭代第二个文件可能违反直觉,但是它使您避免创建3000万个元素哈希。

use feature 'say';

my ($path_1, $path_2) = @ARGV;

open my $fh1,"<",$path_1;
my %f1;
$f1{$_} = $. while (<$fh1>);

open my $fh2,"<",$path_2;
while (<$fh2>) {
    if (my $f1_line = $f1{$_}) {
        say "file 1 line $f1_line appears in file 2 line $.";  
    }
}

Note that without further processing, the duplicated lines will display in the order they appear in the second file, not first. 请注意,未经进一步处理,重复的行将按照它们在第二个文件中而不是在第一个文件中出现的顺序显示。

Also, this assumes file 1 does not have duplicate lines, but that can be handled if necessary. 同样,这假定文件1没有重复的行,但是可以在必要时进行处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM