简体   繁体   中英

Reading groups of lines from a large text file

I am looking to pull certain groups of lines from large (~870,000,000 line) text files. For example in a 50 line file I might want lines 3-6, 18-27, and 39-45.

From browsing Stack Overflow, I have found that the bash command:

tail -n+NUMstart file |head -nNUMend

is the fastest way to get a single line or group of lines starting at NUMstart and going to NUMend. However when reading multiple groups of lines this seems inefficient. Normally the technique wouldn't matter so much, but with files this large it makes a huge difference.

Is there a better way to go about this than using the above command for each group of lines? I am assuming the answer will most likely be a bash command but am really open to any language/tool that will do the job best.

To show lines 3-6, 18-27 and 39-45 with sed:

sed -n "3,6p;18,27p;39,45p" file

It is also possible to feed sed from a file.

Content of file foobar :

3,6p
18,27p
39,45p

Usage:

sed -n -f foobar file

awk to the rescue!

 awk -v lines='3-6,18-27,39-45' '
       BEGIN {n=split(lines,a,","); 
              for(i=1;i<=n;i++) 
                {split(a[i],t,"-"); 
                 rs[++c]=t[1]; re[c]=t[2]}} 

             {for(i=s;i<=c;i++) 
              if(NR>=rs[i] && NR<=re[i]) {print; next} 
              else if(NR>re[i]) s++; 
              if(s>c) exit}' file

provides an early exit after the last printed line. No error checking, the ranges should be provided in increasing order.

The problem with tail -n XX file | head -n YY tail -n XX file | head -n YY for different ranges is that you are running it several times, hence the inefficiency. Otherwise, benchmarks suggest that they are the best solution.

For this specific case, you may want to use awk :

awk '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2) || ...' file

In your case:

awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45)' file

That is, you group the ranges and let awk print the corresponding lines when they occur, just looping through the file once. It may be also useful to add a final NR==endX {exit} ( endX being the closing item from the last range) so that it finishes processing once it has read the last interesting line.

In your case:

awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45); NR==45 {exit}' file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM