I am looking to pull certain groups of lines from large (~870,000,000 line) text files. For example in a 50 line file I might want lines 3-6, 18-27, and 39-45.
From browsing Stack Overflow, I have found that the bash command:
tail -n+NUMstart file |head -nNUMend
is the fastest way to get a single line or group of lines starting at NUMstart and going to NUMend. However when reading multiple groups of lines this seems inefficient. Normally the technique wouldn't matter so much, but with files this large it makes a huge difference.
Is there a better way to go about this than using the above command for each group of lines? I am assuming the answer will most likely be a bash command but am really open to any language/tool that will do the job best.
To show lines 3-6, 18-27 and 39-45 with sed:
sed -n "3,6p;18,27p;39,45p" file
It is also possible to feed sed from a file.
Content of file foobar
:
3,6p 18,27p 39,45p
Usage:
sed -n -f foobar file
awk
to the rescue!
awk -v lines='3-6,18-27,39-45' '
BEGIN {n=split(lines,a,",");
for(i=1;i<=n;i++)
{split(a[i],t,"-");
rs[++c]=t[1]; re[c]=t[2]}}
{for(i=s;i<=c;i++)
if(NR>=rs[i] && NR<=re[i]) {print; next}
else if(NR>re[i]) s++;
if(s>c) exit}' file
provides an early exit after the last printed line. No error checking, the ranges should be provided in increasing order.
The problem with tail -n XX file | head -n YY
tail -n XX file | head -n YY
for different ranges is that you are running it several times, hence the inefficiency. Otherwise, benchmarks suggest that they are the best solution.
For this specific case, you may want to use awk
:
awk '(NR>=start1 && NR<=end1) || (NR>=start2 && NR<=end2) || ...' file
In your case:
awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45)' file
That is, you group the ranges and let awk
print the corresponding lines when they occur, just looping through the file once. It may be also useful to add a final NR==endX {exit}
( endX
being the closing item from the last range) so that it finishes processing once it has read the last interesting line.
In your case:
awk '(NR>=3 && NR<=6) || (NR>=18 && NR<=27) || (NR>=39 && NR<=45); NR==45 {exit}' file
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.