I have a file, say "Line_File" with a list of line start & end numbers and file ID :
F_a 1 108
F_b 109 1210
F_c 131 1190
I have another file, "Data_File" from where I need to fetch all the lines between the line numbers fetched from the Line_File.
The command in sed:
'sed -n '1,108p' Data_File > F_a.txt
does the job but I need to do this for all the values in columns 2 & 3 of Line_File and save it with the file name mentioned in the column 1 of the Line_File.
If $1, $2 and $3 are the three cols of Line_File then I am looking for a command something like
'sed -n '$2,$3p' Data_File > $1.txt
I can run the same using Bash Loop but that will be very slow for a very large file, say 40GB.
I specifically want to do this because I am trying to use GNU Parallel to make it faster and line number based slicing will make the output non-overlapping. I am trying to execute command like this
cat Data_File | parallel -j24 --pipe --block 1000M --cat LC_ALL=C sed -n '$2,$3p' > $1.txt
But I am no able to actually use the column assignment $1,$2 and $3 properly.
I tried the following command:
awk '{system("sed -n \""$2","$3"p\" Data_File > $1"NR)}' Line_File
But it doesn't work. Any idea where I am going wrong?
PS If my question is not clear then please point out what else I should be sharing.
You may use xargs
with -P
(parallel) option:
xargs -P 8 -L 1 bash -c 'sed -n "$2,$3p" Data_File > $1.txt' _ < Line_File
Explanation:
xargs
command takes Line_File
as input by using <
-P 8
option allows it to run up to 8 processes in parallel -L 1
makes xargs
process one line at a time bash -c ...
forks bash
for each line in input file _
before <
passes _
as $0
and passes remaining 3 column in each input line as $1, $2,
$3` sed -n
runs sed
command for each line by forming a command line Or you may use gnu parallel
like this:
parallel --colsep '[[:blank:]]' "sed -n '{2},{3}p' Data_File > {1}.txt" :::: Line_File
awk
to the rescue!
this scans the data file only once
$ awk 'NR==FNR {k=$1; s[k]=$2; e[k]=$3; next}
{for(k in s) if(FNR>=s[k] && FNR<=e[k]) print > (k".txt")}' lines data
This might work for you (GNU parallel and sed):
parallel --dry-run -a lineFile -C' ' "sed -n '{2},{3}p' dataFile > {1}'
This uses the column separator -C ' '
and sets it to a space, this then sets the first 3 fields of the lineFile to {1}
, {2}
and {3}
. The --dry-run
option allows you to check the commands parallel generates before running for real. Once the commands look correct remove the --dry-run
option.
You are likely not to be CPU constrained. It is more likely your disks will be the limiting factor. To avoid reading DataFile over and over again, you should run as many jobs as possible in parallel. That way caching will help you:
cat Line_file |
parallel -j0 --colsep ' ' sed -n {2},{3}p Data_File \> {1}.txt
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.