简体   繁体   English

如何在Perl中搜索大文件?

[英]How can I search a large sorted file in Perl?

Can you suggest me any CPAN modules to search on a large sorted file? 您能建议我使用任何CPAN模块来搜索大文件吗?

The file is a structured data about 15 million to 20 million lines, but I just need to find about 25,000 matching entries so I don't want to load the whole file into a hash. 该文件是大约1500万到2000万行的结构化数据,但是我只需要找到大约25,000个匹配条目,所以我不想将整个文件加载到哈希中。

Thanks. 谢谢。

Perl is well-suited to doing this, without the need for an external module (from CPAN or elsewhere). Perl非常适合执行此操作,而无需外部模块(来自CPAN或其他地方)。

Some code: 一些代码:

while (<STDIN>) {
    if (/regular expression/) {
         process each matched line
    }
}

You'll need to come up with your own regular expression to specify which lines you want to match in your file. 您将需要使用自己的正则表达式来指定要在文件中匹配的行。 Once you match, you need your own code to process each matched line. 匹配后,您需要自己的代码来处理每条匹配的行。

Put the above code in a script file and run it with your file redirected to stdin. 将以上代码放在脚本文件中,然后将文件重定向到stdin来运行它。

A scan over the whole file may be the fastest way. 扫描整个文件可能是最快的方法。 You can also try File::Sorted , which will do a binary search for a given record. 您也可以尝试File :: Sorted ,它将对给定的记录进行二进制搜索。 Locating one record in a 25 million line file should require about 15-20 seeks for each record. 在2500万行文件中查找一条记录,每条记录大约需要15-20次寻道。 This means that to search for 25,000 records, you would only need around .5 million seeks/comparison, compared to 25,000,000 to naively examine each row. 这意味着要搜索25,000条记录,您只需要约0.5百万次查找/比较,而仅需天真地检查每一行就需要25,000,000次。

Disk IO being what it is, you may want to try the easy way first, but File::Sorted is a theoretical win. 磁盘IO就是如此,您可能想先尝试简单的方法,但是File :: Sorted是理论上的胜利。

You don't want to search the file, so do what you can to avoid it. 您不想搜索该文件,请尽力避免该文件。 We don't know much about your problem, but here are some tricks I've used in previous problems, all of which try to do work ahead of time: 我们对您的问题了解不多,但是以下是我在以前的问题中使用过的一些技巧,所有这些技巧都试图尽早完成:

  • Break up the file into a database. 将文件分解为数据库。 That could be SQLite, even. 甚至可能是SQLite。
  • Pre-index the file based on the data that you want to search. 根据要搜索的数据对文件进行预索引。
  • Cache the results from previous searches. 缓存先前搜索的结果。
  • Run common searches ahead of time, automatically. 自动提前运行常见搜索。

All of these trade storage space to for speed. 所有这些都以提高存储空间为代价。 Some some these I would set up as overnight jobs so they were ready for people when they came into work. 我将其中一些设置为通宵工作,以便他们为上班族做好准备。

You mention that you have structured data, but don't say any more. 您提到您已经结构化了数据,但是不再赘述。 Is each line a complete record? 每行都是完整的记录吗? How often does this file change? 此文件多久更改一次?

Sounds like you really want a database. 听起来您确实想要数据库。 Consider SQLite, using Perl's DBI and DBD::SQLite modules. 考虑使用Perl的DBI和DBD :: SQLite模块的SQLite。

When you process an input file with while ( <$filehandle> ) , it only takes the file one line at a time (for each iteration of the loop), so you don't need to worry about it clogging up your memory. 当您使用while ( <$filehandle> )处理输入文件while ( <$filehandle> ) ,一次只占用一行文件(对于循环的每次迭代),因此您不必担心它会阻塞您的内存。 Not so with a for loop, which slurps the whole file into memory. for循环,情况并非如此,它将整个文件插入到内存中。 Use a regex or whatever else to find what you're looking for and put that in a variable/array/hash or write it out to a new file. 使用正则表达式或其他任何内容来查找所需内容,并将其放入变量/数组/哈希中,或将其写到新文件中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM