[英]Character substitution in file based on position extracted from another file
I have a list of files each containing six lines.我有一个文件列表,每个文件包含六行。 A representative example is shown below.
代表性示例如下所示。
cat test.fa
>chain A
MIRLGAPQTLVLLTLLVAAVLRCQGQDVQEAGSCVQDGQRYNDKDVWKPEPCRICVCDTGTVLCDDIICEDVKDCLSPEIPFGECCPICPTDLATASGQPGPKGQKGEPGDIKDIVGPKGPPGPQGPAGEQGPRGDRGDKGEKGAPGPRGRDGEPGTPGNPGPPGPPGPPGPPGLGGNFAAQMAGGFDEKAGGAQLGVMQGPMGPMGPRGPPGPAGAPGPQGFQGNPGEPGEPGVSGPMGPRGPPGPPGKPGDDGEAGKPGKAGERGPPGPQGARGFPGTPGLPGVKGHRGYPGLDGAKGEAGAPGVKGESGSPGENGSPGPMGPRGLPGERGRTGPAGAAGARGNDGQPGPAGPPGPVGPAGGPGFPGAPGAKGE
>chain B
MMSFVQKGSWLLLALLHPTIILAQQEAVEGGCSHLGQSYADRDVWKPEPCQICVCDSGSVLCDDIICDDQELDCPNPEIPFGECCAVCPQPPTAPTRPPNGQGPQGPKGDPGPPGIPGRNGDPGIPGQPGSPGSPGPPGICESCPTGPQNYSPQYDSYDVKSGVAVGGLAGYPGPAGPPGPPGPPGTSGHPGSPGSPGYQGPPGEPGQAGPSGPPGPPGAIGPSGPAGKDGESGRPGRPGERGLPGPPGIKGPAGIPGFPGMKGHRGFDGRNGEKGETGAPGLKGENGLPGENGAPGPMGPRGAPGERGRPGLPGAAGARGNDGARGSDGQPGPPGPPGTAGFPGSPGAKGEVGPAGSPGSNGAPGQRGEPGPQGH
>chain C
MLPQIPFLLLVSLNLVHGVFYAERYQMPTGIKGPLPNTKTQFFIPYTIKSKGIAVRGEQGTPGPPGPAGPRGHPGPSGPPGKPGYGSPGLQGEPGLPGPPGPSAVGKPGVPGLPGKPGERGPYGPKGDVGPAGLPGPRGPPGPPGIPGPAGISVPGKPGQQGPTGAPGPRGFPGEKGAPGVPGMNGQKGEMGYGAPGRPGERGLPGPQGPTGPSGPPGVGKRGENGVPGQPGIKGDRGFPGEMGPIGPPGPQGPPGERGPEGIGKPGAAGAPGQPGIPGTKGLPGAPGIAGPPGPPGFGKPGLPGLKGERGPAGLPGGPGAKGEQGPAGLPGKPGLTGPPGNMGPQGPKGIPGSHGLPGPKGETGPAGPAGYPGAK
Reading row-by-row another file called test.list, I would like to substitute character position 140 in chain A of test.fa with "0" if the third column is "K" and character position 142 of chain B with "1" if the fourth column is E. Same for other rows.逐行读取另一个名为 test.list 的文件,如果第三列为“K”,我想将 test.fa 的链 A 中的字符 position 140 替换为“0”,将链 B 的字符 position 142 替换为“1” " 如果第四列是 E。其他行也一样。
cat test.list
A-B 140-142 K E
B-C 140-142 K E
A-B 299-301 K E
B-C 299-301 K E
I cannot figure out how to get a headstart.我不知道如何抢先一步。 Really appreciate any help!
非常感谢任何帮助!
Interpretation of the task任务解读
Files are to be processed in pairs where a file with file-extension .list
contains information needed to modify a corresponding file with file-extension .fa
.文件将成对处理,其中文件扩展名为
.list
的文件包含修改文件扩展名为.fa
的相应文件所需的信息。
The .fa
files contain fasta
-formatted sequences for three chains identified A, B, or C. I have assumed every .fa
file will contain three sequences identified A, B, and C. (if other identifiers are used, the script will need to be modified to extract identifiers for the chains). .fa
文件包含标识为 A、B 或 C 的三个链的fasta
格式序列。我假设每个.fa
文件将包含标识为 A、B 和 C 的三个序列。 (如果使用其他标识符,脚本将需要进行修改以提取链的标识符)。
The .list
files contain four sets of conditions specifying changes to be made to the identified chains in the corresponding .fa
file. .list
文件包含四组条件,指定要对相应.fa
文件中已识别链进行的更改。 Any changes accumulate as each condition is processed.随着每个条件的处理,任何更改都会累积。 Each condition is formatted as follows:
每个条件的格式如下:
A-B 140-142 K E
Using the values in the above (one of four per pair of files) condition example, the instructions become:使用上面的值(每对文件四个之一)条件示例,指令变为:
If position 140 in chain A (of the corresponding
.fa
file) is currently K如果链A中的position 140 (对应的
.fa
文件)当前是K
AND If position 142 in chain B (of the corresponding
.fa
file) is currently EAND如果链B中的 position 142 (对应的
.fa
文件)当前为E
PROVIDED BOTH CONDITIONS ARE MET change position 140 in chain A to a
0
AND change position 142 in chain B to1
.假设满足两个条件将链A中的 position 140更改为
0
并将链B中的position 142更改为1
。
If only one (or no) condition is met, make no change.
如果只满足一个(或不满足)条件,则不做任何更改。 The changes will always be
0
for the first position and1
for the second position.第一个 position 的更改始终为
0
,第二个 position 的更改始终为1
。
The changed chains are then subjected to tests for the next three condition sets pertaining to that sequence file, accumulating changes with each condition set.然后对与该序列文件相关的接下来的三个条件集对更改的链进行测试,并累积每个条件集的变化。
Approach方法
Since many paired files may exist, a bash script
can be used to interrogate a directory
containing target files to identify pairs of matched files to be processed.由于可能存在许多配对文件,因此可以使用
bash script
查询包含目标文件的directory
,以识别要处理的匹配文件对。 The target directory will be passed as an argument to the script.目标目录将作为参数传递给脚本。
The bash script will then pass pairs of matched files to an awk
script, embedded in the bash script
, where the values of the .list
file can be read and the corresponding changes made to the .fa
files.然后 bash 脚本会将匹配的文件对传递给嵌入在
awk
脚本中的bash script
,其中可以读取.list
文件的值并对.fa
文件进行相应的更改。
Because of the potential for unforeseen errors, the original .fa
files are not overwritten.由于可能出现不可预见的错误,原始
.fa
文件不会被覆盖。 Instead files containing the substituted sequences will be written to a set of new files (in a sub-directory
of the target directory).相反,包含替换序列的文件将被写入一组新文件(在目标目录的
sub-directory
中)。
In order to track the changes and report any unmatched files, the script will also create a log file
to record the changes made to each file and note any unmatched files.为了跟踪更改并报告任何不匹配的文件,脚本还将创建一个
log file
来记录对每个文件所做的更改并记录任何不匹配的文件。
awk
steps awk
步
The bash script
passes the .list
file to awk
before the corresponding .fa
sequence file. bash script
在相应的.fa
序列文件之前将.list
文件传递给awk
。 Thus, when the record number NR
(equivalent to accumulated line numbers across both files but beginning with the .list
file) is equal to the record number of the current file FNR
, the script is processing the first file and an action block can be assigned to perform steps only on the data in the .list
file.因此,当记录号
NR
(等于两个文件的累积行号,但以.list
文件开头)等于当前文件的记录号FNR
时,脚本正在处理第一个文件,并且可以分配一个动作块仅对.list
文件中的数据执行步骤。 Using this, the conditions data (in the .list
file) can be re-formatted into an array for easier reading later on:使用这个,条件数据(在
.list
文件中)可以重新格式化为一个数组,以便以后更容易阅读:
# this block only executes for lines in the first argument file (the .list file) because it has the condition pattern NR==FNR;
NR==FNR{
split($1,abc,"-"); # splits e.g 'A-B' into an array named 'abc' with 'A' index 1, and 'B' in index 2;
split($2,nums,"-"); # splits e.g '140-142' into an array named 'nums' with '140' index 1, and '142' in index 2;
conditions[NR,1] = abc[1];
conditions[NR,2] = abc[2];
conditions[NR,3] = nums[1];
conditions[NR,4] = nums[2];
conditions[NR,5] = $3; # $3 is e.g. 'K';
conditions[NR,6] = $4; # $4 is e.g. 'E';
conditions[NR,7] = $0; # the whole line;
}
(the comment note examples relate to the example condition AB 140-142 KE
as used above). (注释示例与上面使用的示例条件
AB 140-142 KE
相关)。 A 2-d array named conditions
is built containing the data needed for the tests later on.一个名为
conditions
的二维数组被构建,其中包含稍后测试所需的数据。
A set of three pairs of short blocks follow.接下来是一组三对短块。 These extract the sequence runs for the three chains A, B, and C, storing them in an associative array indexed with the relevant letter.
它们提取三个链 A、B 和 C 的序列运行,将它们存储在以相关字母索引的关联数组中。 The first of each pair identifies the record (line) number
FNR
of the line following the line in which the chain identifier is found, storing it in a variable named aline
, bline
or cline
for the three chains in each .fa
file:每对中的第一个标识发现链标识符的行之后的行的记录(行)号
FNR
,将其存储在每个.fa
文件中三个链的名为aline
、 bline
或cline
的变量中:
/>chain A/ {aLine = FNR+1}
FNR==aLine {chains["A"] = $0}
/>chain B/ {bLine = FNR+1}
FNR==bLine {chains["B"] = $0}
/>chain C/ {cLine = FNR+1}
FNR==cLine {chains["C"] = $0;}
When the line after the line containing the label, for example 'chain A', is encountered ( FNR==aLine
) the sequence found in that line is recorded in the chains
array.当遇到包含 label 的行之后的行(例如“链 A”)时 (
FNR==aLine
),在该行中找到的序列将记录在chains
数组中。
The condition tests and substitutions are made entirely in the awk
END
block, which executes once after the earlier blocks have been applied to each record (line) of the argument files.条件测试和替换完全在
awk
END
块中进行,该块在前面的块应用于参数文件的每个记录(行)后执行一次。
The END
block contains a loop to cycle through each condition, making any specified changes before moving to the next condition. END
块包含循环遍历每个条件,在移动到下一个条件之前进行任何指定的更改。
The logic test is performed inside the loop using an if
conditional, which specifies that both conditions specified in the current condition line if the block of code is to be executed.逻辑测试是在循环内使用
if
条件执行的,如果要执行代码块,则指定当前条件行中指定的两个条件。
if (substr(chains[conditions[i,1]],conditions[i,3],1)==conditions[i,5] && substr(chains[conditions[i,2]],conditions[i,4],1)==conditions[i,6]) { # execute of both condition met;}
The right hand side of each comparison refers to the condition letter 'K' or 'E' from the earlier example, checking if it is currently present at the position specified in the left hand side of the condition.每个比较的右侧引用前面示例中的条件字母“K”或“E”,检查它当前是否出现在条件左侧指定的 position 中。 The current character at that position is extracted from the sequence, stored in
chains[conditions[i,1]]
(which resolves to chain array elements eg chains[A]
or chains[B]
) using awk
's substr
function to extract the single character present at position condition[i,3]
. position 处的当前字符从序列中提取,存储在
chains[conditions[i,1]]
中(解析为链数组元素,例如chains[A]
或chains[B]
),使用awk
的子substr
function 提取出现在 position condition[i,3]
的单个字符。
If both conditions are met, the substitutions are made by joining substr
s of the regions flanking the target position either side of the required '0' or '1'.如果两个条件都满足,则通过将目标 position 侧翼区域的
substr
连接到所需“0”或“1”的任一侧来进行替换。 The sequences, altered or not, are then written to file.然后将序列(无论是否更改)写入文件。
Implementing the script执行脚本
The target files, one .list
file for each .fa
, file should be placed together in a directory.目标文件,每个
.fa
一个.list
文件,文件应该放在一个目录中。 The directory name will be the argument for the script.目录名称将是脚本的参数。 The name part of the filenames should match for the matched pairs (eg
sequence1.fa
should have a matched conditions file named sequence1.list
).文件名的名称部分应该匹配匹配对(例如
sequence1.fa
应该有一个名为sequence1.list
的匹配条件文件)。
(assuming the script is saved in a file named process_list_fa.sh
). (假设脚本保存在名为
process_list_fa.sh
的文件中)。 The script must be made executable.该脚本必须可执行。 This can be done on the command line using:
这可以在命令行上使用:
chmod +x process_list_fa.sh
The script is then called from the command line as follows:然后从命令行调用该脚本,如下所示:
./process_list_fa.sh path/to/directory
The path/to/directory
argument must represent the directory under which the matched files are stored. path/to/directory
参数必须代表存储匹配文件的目录。 (don't use a trailing '/'). (不要使用尾随“/”)。
The script will execute, reporting progress to the command line.该脚本将执行,并向命令行报告进度。 It may halt if, for example, the number of.fa files is different to the number of.list files, and await confirmation to continue.
例如,如果 .fa 文件的数量与 .list 文件的数量不同,它可能会停止,并等待确认以继续。
Once complete, the target directory will now contain a sub directory named output
.完成后,目标目录现在将包含一个名为
output
的子目录。 Inside the output
directory will be a collection of files containg the modified sequences.在
output
目录中将是一个包含修改序列的文件集合。 Each will be named as it was originally but pre-pended with modified_
.每个都将按原样命名,但在前面加上
modified_
。 A log.txt
file is also created summarising the changes made, and any files skipped.还会创建一个
log.txt
文件,总结所做的更改和跳过的所有文件。 This may be useful for checking the files, especially if large numbers of pairs are processed.这对于检查文件可能很有用,尤其是在处理大量对时。
The process_list_fa.sh
script process_list_fa.sh
脚本
copy the entire following snippet and save in a file named process_list_fa.sh
.复制以下整个代码段并保存在名为
process_list_fa.sh
的文件中。 Make the file executable ( chmod +x process_list_fa.sh
) and execute it passing the directory containing the sequences as an argument ( ./process_list_fa.sh path/directory
)使文件可执行 (
chmod +x process_list_fa.sh
) 并执行它传递包含序列的目录作为参数 ( ./process_list_fa.sh path/directory
)
#!/bin/bash
# check for target directory passed as argument;
baseDir=$1;
if [ ! -d "$baseDir" ]; then
echo "target directory $baseDir not found. exiting";
exit;
fi
outputDir=$baseDir/output;
if [ ! -d "$outputDir" ]; then
mkdir $outputDir;
fi
logFile=$outputDir/log.txt;
echo "directory: "$baseDir"/" > $logFile;
echo "results will be stored in: "$outputDir >> $logFile;
echo "checking for .fa and .list files inside $baseDir..."
skipCount=0;
skippedFiles=(); # array to store skipped file names in;
for FILE in $baseDir/*.fa; do
#faCount=(( $faCount +1 ));
((faCount++));
done;
for FILE in $baseDir/*.list; do
((listCount++));
done;
echo $faCount ".fa files found" >> $logFile;
echo $listCount ".list files found" >> $logFile;
echo $faCount ".fa files found, and " $listCount ".list files found";
if [ $faCount == $listCount ]; then
echo "beginning replacement procedure...";
else
echo "WARNING - file numbers not equal";
echo "if you choose to continue, unmatched files will be skipped";
while true; do
read -p "Do you wish to continue? [y/n]" yn
case $yn in
[Yy]* ) echo "results follow:" >> $logFile;
echo "" >> $logFile;
break;;
[Nn]* ) echo "exiting... bye";
echo "process quit by user" >> $outputDir/log.txt;
exit;;
* ) echo "Please type y (continue) or n (exit).";;
esac
done;
fi
processCount=1;
ext=".list"
for FILE in $baseDir/*.fa; do
listFile=${FILE/.fa/$ext};
# does the .list file exist?
if [ ! -f "$listFile" ]; then
echo "Warning: no matched .list file for $FILE found. Skipping $FILE";
#echo "" >> $logFile;
#echo "No matched .list file for $FILE found. skipping $FILE" >> $logFile;
((skipCount++));
skippedFiles+=($FILE); # append to array;
continue;
fi
echo "file number "$processCount $FILE;
echo "matched list file: "$listFile "found";
echo "processing files "$FILE " and "$listFile;
echo "" >> $logFile;
echo "matched pair $processCount: " $FILE " and "$listFile >> $logFile;
((processCount++));
#### awk script ####
awk '
BEGIN {numConditionSets=0;}
NR==FNR{
numConditionsSets++;
# organise conditions data into an array;
split($1,abc,"-");
split($2,nums,"-");
conditions[NR,1] = abc[1];
conditions[NR,2] = abc[2];
conditions[NR,3] = nums[1];
conditions[NR,4] = nums[2];
conditions[NR,5] = $3;
conditions[NR,6] = $4;
conditions[NR,7] = $0; # the whole line;
# make path to pre-existing log file;
if (NR==1) {
n = split(FILENAME, a, "/");
for (i=1; i<n; i++) {basename=basename a[i]"/";}
logFile = basename"output/log.txt";
} # end 1st line block;
} # end NR==FNR block;
# store chains in chains array;
/>chain A/ {aLine = FNR+1}
FNR==aLine {chains["A"] = $0}
/>chain B/ {bLine = FNR+1}
FNR==bLine {chains["B"] = $0}
/>chain C/ {cLine = FNR+1}
FNR==cLine {chains["C"] = $0;}
END {
# get .fa identified from file path;
n = split(FILENAME, a, "/");
fileID = a[n];
outPath = basename"output/"; # path to save output .fa files;
# loop conditions, mutating sequences along the way;
for (i=1; i<=numConditionsSets; i++) {
if (substr(chains[conditions[i,1]],conditions[i,3],1)==conditions[i,5] && substr(chains[conditions[i,2]],conditions[i,4],1)==conditions[i,6]) {
chains[conditions[i,1]] = substr(chains[conditions[i,1]],1,conditions[i,3]-1)"0"substr(chains[conditions[i,1]],conditions[i,3]+1);
chains[conditions[i,2]] = substr(chains[conditions[i,2]],1,conditions[i,4]-1)"1"substr(chains[conditions[i,2]],conditions[i,4]+1);
print "condition "i" ("conditions[i,7]"): chain "conditions[i,1]" residue "conditions[i,3]" "conditions[i,5]" to 0, chain "conditions[i,2]" residue "conditions[i,4]" "conditions[i,6]" to 1," >> logFile;
} else {print "condition "i" ("conditions[i,7]"): not met, no changes made," >> logFile;}
} # next i condition set;
for (indx in chains) {
print ">chain "indx"\n"chains[indx] >> outPath"modified_"fileID;
}
} # end END block;
' $listFile $FILE
done; # end for-in-do;
#### log unmatched/skipped .fa files ######
echo "" >> $logFile;
echo "${#skippedFiles[@]} .fa files were unmatched with .list files and skipped:" >> $logFile;
for i in "${skippedFiles[@]}"
do
echo "$i" >> $logFile;
done
echo "" >> $logFile;
echo "END" >> $logFile;
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.