Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel

Question

I have a file, say "Line_File" with a list of line start & end numbers and file ID :

F_a 1 108
F_b 109 1210
F_c 131 1190

I have another file, "Data_File" from where I need to fetch all the lines between the line numbers fetched from the Line_File.

The command in sed:

'sed -n '1,108p' Data_File > F_a.txt

does the job but I need to do this for all the values in columns 2 & 3 of Line_File and save it with the file name mentioned in the column 1 of the Line_File.

If $1, $2 and $3 are the three cols of Line_File then I am looking for a command something like

'sed -n '$2,$3p' Data_File > $1.txt

I can run the same using Bash Loop but that will be very slow for a very large file, say 40GB.

I specifically want to do this because I am trying to use GNU Parallel to make it faster and line number based slicing will make the output non-overlapping. I am trying to execute command like this

cat Data_File | parallel -j24 --pipe --block 1000M --cat LC_ALL=C sed -n '$2,$3p' > $1.txt

But I am no able to actually use the column assignment $1,$2 and $3 properly.

I tried the following command:

awk '{system("sed -n \""$2","$3"p\" Data_File > $1"NR)}' Line_File

But it doesn't work. Any idea where I am going wrong?

PS If my question is not clear then please point out what else I should be sharing.

Answer 1

You may use xargs with -P (parallel) option:

xargs -P 8 -L 1 bash -c 'sed -n "$2,$3p" Data_File > $1.txt' _ < Line_File

Explanation:

This xargs command takes Line_File as input by using <
-P 8 option allows it to run up to 8 processes in parallel
-L 1 makes xargs process one line at a time
bash -c ... forks bash for each line in input file
_ before < passes _ as $0 and passes remaining 3 column in each input line as $1, $2, $3`
sed -n runs sed command for each line by forming a command line

Or you may use gnu parallel like this:

parallel --colsep '[[:blank:]]' "sed -n '{2},{3}p' Data_File > {1}.txt" :::: Line_File

Check parallel examples from official doc

Answer 2

awk to the rescue!

this scans the data file only once

$ awk 'NR==FNR {k=$1; s[k]=$2; e[k]=$3; next} 
               {for(k in s) if(FNR>=s[k] && FNR<=e[k]) print > (k".txt")}' lines data

Answer 3

This might work for you (GNU parallel and sed):

parallel --dry-run -a lineFile -C' ' "sed -n '{2},{3}p' dataFile > {1}'

This uses the column separator -C ' ' and sets it to a space, this then sets the first 3 fields of the lineFile to {1} , {2} and {3} . The --dry-run option allows you to check the commands parallel generates before running for real. Once the commands look correct remove the --dry-run option.

Answer 4

You are likely not to be CPU constrained. It is more likely your disks will be the limiting factor. To avoid reading DataFile over and over again, you should run as many jobs as possible in parallel. That way caching will help you:

cat Line_file |
  parallel -j0 --colsep ' ' sed -n {2},{3}p Data_File \> {1}.txt

Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel

Question

4 answers

solution1
3 ACCPTED 2019-08-29 15:53:08

solution2
1 2019-08-29 20:06:52

solution3
0 2019-08-29 23:06:08

solution4
0 2019-08-30 06:21:16

Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel

Question

4 answers

solution1 3 ACCPTED 2019-08-29 15:53:08

solution2 1 2019-08-29 20:06:52

solution3 0 2019-08-29 23:06:08

solution4 0 2019-08-30 06:21:16

solution1
3 ACCPTED 2019-08-29 15:53:08

solution2
1 2019-08-29 20:06:52

solution3
0 2019-08-29 23:06:08

solution4
0 2019-08-30 06:21:16