简体   繁体   中英

Print lines between line numbers from a line list and save every instance in separate file using GNU Parallel

I have a file, say "Line_File" with a list of line start & end numbers and file ID :

F_a 1 108
F_b 109 1210
F_c 131 1190

I have another file, "Data_File" from where I need to fetch all the lines between the line numbers fetched from the Line_File.

The command in sed:

'sed -n '1,108p' Data_File > F_a.txt 

does the job but I need to do this for all the values in columns 2 & 3 of Line_File and save it with the file name mentioned in the column 1 of the Line_File.

If $1, $2 and $3 are the three cols of Line_File then I am looking for a command something like

'sed -n '$2,$3p' Data_File > $1.txt

I can run the same using Bash Loop but that will be very slow for a very large file, say 40GB.

I specifically want to do this because I am trying to use GNU Parallel to make it faster and line number based slicing will make the output non-overlapping. I am trying to execute command like this

cat Data_File | parallel -j24 --pipe --block 1000M --cat LC_ALL=C sed -n '$2,$3p' > $1.txt

But I am no able to actually use the column assignment $1,$2 and $3 properly.

I tried the following command:

awk '{system("sed -n \""$2","$3"p\" Data_File > $1"NR)}' Line_File

But it doesn't work. Any idea where I am going wrong?

PS If my question is not clear then please point out what else I should be sharing.

You may use xargs with -P (parallel) option:

xargs -P 8 -L 1 bash -c 'sed -n "$2,$3p" Data_File > $1.txt' _ < Line_File

Explanation:

  • This xargs command takes Line_File as input by using <
  • -P 8 option allows it to run up to 8 processes in parallel
  • -L 1 makes xargs process one line at a time
  • bash -c ... forks bash for each line in input file
  • _ before < passes _ as $0 and passes remaining 3 column in each input line as $1, $2, $3`
  • sed -n runs sed command for each line by forming a command line

Or you may use gnu parallel like this:

parallel --colsep '[[:blank:]]' "sed -n '{2},{3}p' Data_File > {1}.txt" :::: Line_File

Check parallel examples from official doc

awk to the rescue!

this scans the data file only once

$ awk 'NR==FNR {k=$1; s[k]=$2; e[k]=$3; next} 
               {for(k in s) if(FNR>=s[k] && FNR<=e[k]) print > (k".txt")}' lines data

This might work for you (GNU parallel and sed):

parallel --dry-run -a lineFile -C' ' "sed -n '{2},{3}p' dataFile > {1}' 

This uses the column separator -C ' ' and sets it to a space, this then sets the first 3 fields of the lineFile to {1} , {2} and {3} . The --dry-run option allows you to check the commands parallel generates before running for real. Once the commands look correct remove the --dry-run option.

You are likely not to be CPU constrained. It is more likely your disks will be the limiting factor. To avoid reading DataFile over and over again, you should run as many jobs as possible in parallel. That way caching will help you:

cat Line_file |
  parallel -j0 --colsep ' ' sed -n {2},{3}p Data_File \> {1}.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM