简体   繁体   English

从大文本文件中读取行组

[英]Reading groups of lines from a large text file

I am looking to pull certain groups of lines from large (~870,000,000 line) text files. 我希望从大型(〜870,000,000行)文本文件中提取某些行组。 For example in a 50 line file I might want lines 3-6, 18-27, and 39-45. 例如,在50行文件中,我可能需要3-6、18-27和39-45行。

From browsing Stack Overflow, I have found that the bash command: 通过浏览Stack Overflow,我发现bash命令是:

tail -n+NUMstart file |head -nNUMend

is the fastest way to get a single line or group of lines starting at NUMstart and going to NUMend. 是从NUMstart到NUMend的获取单行或多行的最快方法。 However when reading multiple groups of lines this seems inefficient. 但是,当读取多组线时,这似乎效率很低。 Normally the technique wouldn't matter so much, but with files this large it makes a huge difference. 通常,该技术没什么大不了的,但是对于如此大的文件,它将产生巨大的变化。

Is there a better way to go about this than using the above command for each group of lines? 是否有比使用上述命令的每一行更好的方法呢? I am assuming the answer will most likely be a bash command but am really open to any language/tool that will do the job best. 我假设答案很可能是bash命令,但实际上对任何能最好地完成工作的语言/工具都是开放的。

To show lines 3-6, 18-27 and 39-45 with sed: 用sed显示第3-6、18-27和39-45行:

sed -n "3,6p;18,27p;39,45p" file

It is also possible to feed sed from a file. 还可以从文件中馈送sed。

Content of file foobar : 文件foobar内容:

3,6p
18,27p
39,45p

Usage: 用法:

sed -n -f foobar file

awk to the rescue! awk解救!

 awk -v lines='3-6,18-27,39-45' '
       BEGIN {n=split(lines,a,","); 
              for(i=1;i<=n;i++) 
                {split(a[i],t,"-"); 
                 rs[++c]=t[1]; re[c]=t[2]}} 

             {for(i=s;i<=c;i++) 
              if(NR>=rs[i] && NR<=re[i]) {print; next} 
              else if(NR>re[i]) s++; 
              if(s>c) exit}' file

provides an early exit after the last printed line. 在最后一条印刷线之后提早退出。 No error checking, the ranges should be provided in increasing order. 没有错误检查,应按升序提供范围。

The problem with tail -n XX file | head -n YY tail -n XX file | head -n YY的问题tail -n XX file | head -n YY tail -n XX file | head -n YY for different ranges is that you are running it several times, hence the inefficiency. tail -n XX file | head -n YY用于不同范围是因为您多次运行它,因此效率低下。 Otherwise, benchmarks suggest that they are the best solution. 否则, 基准测试表明它们是最佳解决方案。

For this specific case, you may want to use awk : 对于这种特定情况,您可能要使用awk

awk '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2) || ...' file

In your case: 在您的情况下:

awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45)' file

That is, you group the ranges and let awk print the corresponding lines when they occur, just looping through the file once. 也就是说,您对范围进行了分组,并让awk在出现相应行时将它们打印出来,只循环遍历文件一次。 It may be also useful to add a final NR==endX {exit} ( endX being the closing item from the last range) so that it finishes processing once it has read the last interesting line. 添加最终的NR==endX {exit}endX是最后一个范围的结束项)可能也很有用,以便在读取完最后一条有趣的行后完成处理。

In your case: 在您的情况下:

awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45); NR==45 {exit}' file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM