简体   繁体   English

基于从另一个文件中提取的 position 的文件中的字符替换

[英]Character substitution in file based on position extracted from another file

I have a list of files each containing six lines.我有一个文件列表,每个文件包含六行。 A representative example is shown below.代表性示例如下所示。

cat test.fa
>chain A
MIRLGAPQTLVLLTLLVAAVLRCQGQDVQEAGSCVQDGQRYNDKDVWKPEPCRICVCDTGTVLCDDIICEDVKDCLSPEIPFGECCPICPTDLATASGQPGPKGQKGEPGDIKDIVGPKGPPGPQGPAGEQGPRGDRGDKGEKGAPGPRGRDGEPGTPGNPGPPGPPGPPGPPGLGGNFAAQMAGGFDEKAGGAQLGVMQGPMGPMGPRGPPGPAGAPGPQGFQGNPGEPGEPGVSGPMGPRGPPGPPGKPGDDGEAGKPGKAGERGPPGPQGARGFPGTPGLPGVKGHRGYPGLDGAKGEAGAPGVKGESGSPGENGSPGPMGPRGLPGERGRTGPAGAAGARGNDGQPGPAGPPGPVGPAGGPGFPGAPGAKGE
>chain B
MMSFVQKGSWLLLALLHPTIILAQQEAVEGGCSHLGQSYADRDVWKPEPCQICVCDSGSVLCDDIICDDQELDCPNPEIPFGECCAVCPQPPTAPTRPPNGQGPQGPKGDPGPPGIPGRNGDPGIPGQPGSPGSPGPPGICESCPTGPQNYSPQYDSYDVKSGVAVGGLAGYPGPAGPPGPPGPPGTSGHPGSPGSPGYQGPPGEPGQAGPSGPPGPPGAIGPSGPAGKDGESGRPGRPGERGLPGPPGIKGPAGIPGFPGMKGHRGFDGRNGEKGETGAPGLKGENGLPGENGAPGPMGPRGAPGERGRPGLPGAAGARGNDGARGSDGQPGPPGPPGTAGFPGSPGAKGEVGPAGSPGSNGAPGQRGEPGPQGH
>chain C
MLPQIPFLLLVSLNLVHGVFYAERYQMPTGIKGPLPNTKTQFFIPYTIKSKGIAVRGEQGTPGPPGPAGPRGHPGPSGPPGKPGYGSPGLQGEPGLPGPPGPSAVGKPGVPGLPGKPGERGPYGPKGDVGPAGLPGPRGPPGPPGIPGPAGISVPGKPGQQGPTGAPGPRGFPGEKGAPGVPGMNGQKGEMGYGAPGRPGERGLPGPQGPTGPSGPPGVGKRGENGVPGQPGIKGDRGFPGEMGPIGPPGPQGPPGERGPEGIGKPGAAGAPGQPGIPGTKGLPGAPGIAGPPGPPGFGKPGLPGLKGERGPAGLPGGPGAKGEQGPAGLPGKPGLTGPPGNMGPQGPKGIPGSHGLPGPKGETGPAGPAGYPGAK

Reading row-by-row another file called test.list, I would like to substitute character position 140 in chain A of test.fa with "0" if the third column is "K" and character position 142 of chain B with "1" if the fourth column is E. Same for other rows.逐行读取另一个名为 test.list 的文件,如果第三列为“K”,我想将 test.fa 的链 A 中的字符 position 140 替换为“0”,将链 B 的字符 position 142 替换为“1” " 如果第四列是 E。其他行也一样。

cat test.list

A-B 140-142 K E
B-C 140-142 K E
A-B 299-301 K E
B-C 299-301 K E

I cannot figure out how to get a headstart.我不知道如何抢先一步。 Really appreciate any help!非常感谢任何帮助!

Interpretation of the task任务解读

Files are to be processed in pairs where a file with file-extension .list contains information needed to modify a corresponding file with file-extension .fa .文件将成对处理,其中文件扩展名为.list的文件包含修改文件扩展名为.fa的相应文件所需的信息。

The .fa files contain fasta -formatted sequences for three chains identified A, B, or C. I have assumed every .fa file will contain three sequences identified A, B, and C. (if other identifiers are used, the script will need to be modified to extract identifiers for the chains). .fa文件包含标识为 A、B 或 C 的三个链的fasta格式序列。我假设每个.fa文件将包含标识为 A、B 和 C 的三个序列。 (如果使用其他标识符,脚本将需要进行修改以提取链的标识符)。

The .list files contain four sets of conditions specifying changes to be made to the identified chains in the corresponding .fa file. .list文件包含四组条件,指定要对相应.fa文件中已识别链进行的更改。 Any changes accumulate as each condition is processed.随着每个条件的处理,任何更改都会累积 Each condition is formatted as follows:每个条件的格式如下:

A-B 140-142 K E

Using the values in the above (one of four per pair of files) condition example, the instructions become:使用上面的值(每对文件四个之一)条件示例,指令变为:

If position 140 in chain A (of the corresponding .fa file) is currently K如果链A中的position 140 (对应的.fa文件)当前是K

AND If position 142 in chain B (of the corresponding .fa file) is currently E AND如果链B中的 position 142 (对应的.fa文件)当前为E

PROVIDED BOTH CONDITIONS ARE MET change position 140 in chain A to a 0 AND change position 142 in chain B to 1 .假设满足两个条件将链A中的 position 140更改为0并将链B中的position 142更改为1

If only one (or no) condition is met, make no change.如果只满足一个(或不满足)条件,则不做任何更改。 The changes will always be 0 for the first position and 1 for the second position.第一个 position 的更改始终为0 ,第二个 position 的更改始终为1

The changed chains are then subjected to tests for the next three condition sets pertaining to that sequence file, accumulating changes with each condition set.然后对与该序列文件相关的接下来的三个条件集对更改的链进行测试,并累积每个条件集的变化。

Approach方法

Since many paired files may exist, a bash script can be used to interrogate a directory containing target files to identify pairs of matched files to be processed.由于可能存在许多配对文件,因此可以使用bash script查询包含目标文件的directory ,以识别要处理的匹配文件 The target directory will be passed as an argument to the script.目标目录将作为参数传递给脚本。

The bash script will then pass pairs of matched files to an awk script, embedded in the bash script , where the values of the .list file can be read and the corresponding changes made to the .fa files.然后 bash 脚本会将匹配的文件对传递给嵌入在awk脚本中的bash script ,其中可以读取.list文件的值并对.fa文件进行相应的更改。

Because of the potential for unforeseen errors, the original .fa files are not overwritten.由于可能出现不可预见的错误,原始.fa文件不会被覆盖。 Instead files containing the substituted sequences will be written to a set of new files (in a sub-directory of the target directory).相反,包含替换序列的文件将被写入一组新文件(在目标目录的sub-directory中)。

In order to track the changes and report any unmatched files, the script will also create a log file to record the changes made to each file and note any unmatched files.为了跟踪更改并报告任何不匹配的文件,脚本还将创建一个log file来记录对每个文件所做的更改并记录任何不匹配的文件。

awk steps awk

The bash script passes the .list file to awk before the corresponding .fa sequence file. bash script在相应的.fa序列文件之前.list文件传递给awk Thus, when the record number NR (equivalent to accumulated line numbers across both files but beginning with the .list file) is equal to the record number of the current file FNR , the script is processing the first file and an action block can be assigned to perform steps only on the data in the .list file.因此,当记录号NR (等于两个文件的累积行号,但以.list文件开头)等于当前文件的记录号FNR时,脚本正在处理第一个文件,并且可以分配一个动作块仅对.list文件中的数据执行步骤。 Using this, the conditions data (in the .list file) can be re-formatted into an array for easier reading later on:使用这个,条件数据(在.list文件中)可以重新格式化为一个数组,以便以后更容易阅读:

# this block only executes for lines in the first argument file (the .list file) because it has the condition pattern NR==FNR;
NR==FNR{
  split($1,abc,"-"); # splits e.g 'A-B' into an array named 'abc' with 'A' index 1, and 'B' in index 2;
  split($2,nums,"-"); # splits e.g '140-142' into an array named 'nums' with '140' index 1, and '142' in index 2;

conditions[NR,1] = abc[1];
conditions[NR,2] = abc[2];
conditions[NR,3] = nums[1];
conditions[NR,4] = nums[2];
conditions[NR,5] = $3; # $3 is e.g. 'K';
conditions[NR,6] = $4; # $4 is e.g. 'E';
conditions[NR,7] = $0; # the whole line;
}

(the comment note examples relate to the example condition AB 140-142 KE as used above). (注释示例与上面使用的示例条件AB 140-142 KE相关)。 A 2-d array named conditions is built containing the data needed for the tests later on.一个名为conditions的二维数组被构建,其中包含稍后测试所需的数据。

A set of three pairs of short blocks follow.接下来是一组三对短块。 These extract the sequence runs for the three chains A, B, and C, storing them in an associative array indexed with the relevant letter.它们提取三个链 A、B 和 C 的序列运行,将它们存储在以相关字母索引的关联数组中。 The first of each pair identifies the record (line) number FNR of the line following the line in which the chain identifier is found, storing it in a variable named aline , bline or cline for the three chains in each .fa file:每对中的第一个标识发现链标识符的行之后的行的记录(行)号FNR ,将其存储在每个.fa文件中三个链的名为alineblinecline的变量中:

/>chain A/ {aLine = FNR+1}
FNR==aLine {chains["A"] = $0}
/>chain B/ {bLine = FNR+1}
FNR==bLine {chains["B"] = $0}
/>chain C/ {cLine = FNR+1}
FNR==cLine {chains["C"] = $0;}

When the line after the line containing the label, for example 'chain A', is encountered ( FNR==aLine ) the sequence found in that line is recorded in the chains array.当遇到包含 label 的行之后的行(例如“链 A”)时 ( FNR==aLine ),在该行中找到的序列将记录在chains数组中。

The condition tests and substitutions are made entirely in the awk END block, which executes once after the earlier blocks have been applied to each record (line) of the argument files.条件测试和替换完全在awk END块中进行,该块在前面的块应用于参数文件的每个记录(行)后执行一次。

The END block contains a loop to cycle through each condition, making any specified changes before moving to the next condition. END块包含循环遍历每个条件,在移动到下一个条件之前进行任何指定的更改。

The logic test is performed inside the loop using an if conditional, which specifies that both conditions specified in the current condition line if the block of code is to be executed.逻辑测试是在循环内使用if条件执行的,如果要执行代码块,则指定当前条件行中指定的两个条件。

if (substr(chains[conditions[i,1]],conditions[i,3],1)==conditions[i,5] && substr(chains[conditions[i,2]],conditions[i,4],1)==conditions[i,6]) { # execute of both condition met;} 

The right hand side of each comparison refers to the condition letter 'K' or 'E' from the earlier example, checking if it is currently present at the position specified in the left hand side of the condition.每个比较的右侧引用前面示例中的条件字母“K”或“E”,检查它当前是否出现在条件左侧指定的 position 中。 The current character at that position is extracted from the sequence, stored in chains[conditions[i,1]] (which resolves to chain array elements eg chains[A] or chains[B] ) using awk 's substr function to extract the single character present at position condition[i,3] . position 处的当前字符从序列中提取,存储在chains[conditions[i,1]]中(解析为链数组元素,例如chains[A]chains[B] ),使用awk的子substr function 提取出现在 position condition[i,3]的单个字符。

If both conditions are met, the substitutions are made by joining substr s of the regions flanking the target position either side of the required '0' or '1'.如果两个条件都满足,则通过将目标 position 侧翼区域的substr连接到所需“0”或“1”的任一侧来进行替换。 The sequences, altered or not, are then written to file.然后将序列(无论是否更改)写入文件。

Implementing the script执行脚本

The target files, one .list file for each .fa , file should be placed together in a directory.目标文件,每个.fa一个.list文件,文件应该放在一个目录中。 The directory name will be the argument for the script.目录名称将是脚本的参数。 The name part of the filenames should match for the matched pairs (eg sequence1.fa should have a matched conditions file named sequence1.list ).文件名的名称部分应该匹配匹配对(例如sequence1.fa应该有一个名为sequence1.list的匹配条件文件)。

(assuming the script is saved in a file named process_list_fa.sh ). (假设脚本保存在名为process_list_fa.sh的文件中)。 The script must be made executable.该脚本必须可执行。 This can be done on the command line using:这可以在命令行上使用:

chmod +x process_list_fa.sh

The script is then called from the command line as follows:然后从命令行调用该脚本,如下所示:

./process_list_fa.sh path/to/directory

The path/to/directory argument must represent the directory under which the matched files are stored. path/to/directory参数必须代表存储匹配文件的目录。 (don't use a trailing '/'). (不要使用尾随“/”)。

The script will execute, reporting progress to the command line.该脚本将执行,并向命令行报告进度。 It may halt if, for example, the number of.fa files is different to the number of.list files, and await confirmation to continue.例如,如果 .fa 文件的数量与 .list 文件的数量不同,它可能会停止,并等待确认以继续。

Once complete, the target directory will now contain a sub directory named output .完成后,目标目录现在将包含一个名为output的子目录。 Inside the output directory will be a collection of files containg the modified sequences.output目录中将是一个包含修改序列的文件集合。 Each will be named as it was originally but pre-pended with modified_ .每个都将按原样命名,但在前面加上modified_ A log.txt file is also created summarising the changes made, and any files skipped.还会创建一个log.txt文件,总结所做的更改和跳过的所有文件。 This may be useful for checking the files, especially if large numbers of pairs are processed.这对于检查文件可能很有用,尤其是在处理大量对时。

The process_list_fa.sh script process_list_fa.sh脚本

copy the entire following snippet and save in a file named process_list_fa.sh .复制以下整个代码段并保存在名为process_list_fa.sh的文件中。 Make the file executable ( chmod +x process_list_fa.sh ) and execute it passing the directory containing the sequences as an argument ( ./process_list_fa.sh path/directory )使文件可执行 ( chmod +x process_list_fa.sh ) 并执行它传递包含序列的目录作为参数 ( ./process_list_fa.sh path/directory )

#!/bin/bash

# check for target directory passed as argument;
baseDir=$1;
if [ ! -d "$baseDir" ]; then
  echo "target directory $baseDir not found. exiting";
  exit;
fi

outputDir=$baseDir/output;
if [ ! -d "$outputDir" ]; then
  mkdir $outputDir;
fi

logFile=$outputDir/log.txt;

echo "directory: "$baseDir"/" > $logFile;
echo "results will be stored in: "$outputDir >> $logFile;
echo "checking for .fa and .list files inside $baseDir..."
skipCount=0;
skippedFiles=(); # array to store skipped file names in;

for FILE in $baseDir/*.fa; do 
#faCount=(( $faCount +1 ));
((faCount++));
done;

for FILE in $baseDir/*.list; do 
((listCount++));
done;

echo $faCount ".fa files found" >> $logFile;
echo $listCount ".list files found" >> $logFile;

echo $faCount ".fa files found, and " $listCount ".list files found";

if [ $faCount == $listCount ]; then
  echo "beginning replacement procedure..."; 
else 
  echo "WARNING - file numbers not equal";
  echo "if you choose to continue, unmatched files will be skipped";

  while true; do
    read -p "Do you wish to continue? [y/n]" yn
    case $yn in
        [Yy]* ) echo "results follow:" >> $logFile;
                echo "" >> $logFile;
                break;;
        [Nn]* ) echo "exiting... bye";
                echo "process quit by user" >>  $outputDir/log.txt;
                exit;;
        * ) echo "Please type y (continue) or n (exit).";;
    esac
  done;
fi

processCount=1;
ext=".list"

for FILE in $baseDir/*.fa; do 
listFile=${FILE/.fa/$ext};
# does the .list file exist?
if [ ! -f "$listFile" ]; then
  echo "Warning: no matched .list file for $FILE found. Skipping $FILE";
  #echo "" >> $logFile;
  #echo "No matched .list file for $FILE found. skipping $FILE" >> $logFile;
  ((skipCount++));
  skippedFiles+=($FILE); # append to array;
  continue;
fi

echo "file number "$processCount $FILE;
echo "matched list file: "$listFile "found";
echo "processing files "$FILE " and "$listFile;
echo "" >> $logFile;
echo "matched pair $processCount: " $FILE " and "$listFile >> $logFile;
((processCount++));

#### awk script ####

awk '
BEGIN {numConditionSets=0;}

NR==FNR{
numConditionsSets++;

# organise conditions data into an array;
  split($1,abc,"-");
  split($2,nums,"-");

conditions[NR,1] = abc[1];
conditions[NR,2] = abc[2];
conditions[NR,3] = nums[1];
conditions[NR,4] = nums[2];
conditions[NR,5] = $3;
conditions[NR,6] = $4;
conditions[NR,7] = $0; # the whole line;


# make path to pre-existing log file;
  if (NR==1) {
    n = split(FILENAME, a, "/");
      for (i=1; i<n; i++) {basename=basename a[i]"/";}
    logFile = basename"output/log.txt";
  } # end 1st line block;
} # end NR==FNR block;


# store chains in chains array;
/>chain A/ {aLine = FNR+1}
FNR==aLine {chains["A"] = $0}
/>chain B/ {bLine = FNR+1}
FNR==bLine {chains["B"] = $0}
/>chain C/ {cLine = FNR+1}
FNR==cLine {chains["C"] = $0;}


END {

# get .fa identified from file path;
    n = split(FILENAME, a, "/");
    fileID = a[n];
    outPath = basename"output/"; # path to save output .fa files;

# loop conditions, mutating sequences along the way;

for (i=1; i<=numConditionsSets; i++) {

  if (substr(chains[conditions[i,1]],conditions[i,3],1)==conditions[i,5] && substr(chains[conditions[i,2]],conditions[i,4],1)==conditions[i,6]) { 

       chains[conditions[i,1]] = substr(chains[conditions[i,1]],1,conditions[i,3]-1)"0"substr(chains[conditions[i,1]],conditions[i,3]+1);
       chains[conditions[i,2]] = substr(chains[conditions[i,2]],1,conditions[i,4]-1)"1"substr(chains[conditions[i,2]],conditions[i,4]+1);
print "condition "i" ("conditions[i,7]"): chain "conditions[i,1]" residue "conditions[i,3]" "conditions[i,5]" to 0, chain "conditions[i,2]" residue "conditions[i,4]" "conditions[i,6]" to 1," >> logFile;

     } else {print "condition "i" ("conditions[i,7]"): not met, no changes made," >> logFile;}
} # next i condition set;


  for (indx in chains) {
    print ">chain "indx"\n"chains[indx] >> outPath"modified_"fileID;
  } 
 

} # end END block;





' $listFile $FILE 


done; # end for-in-do;

#### log unmatched/skipped .fa files ######
echo "" >> $logFile; 
echo "${#skippedFiles[@]} .fa files were unmatched with .list files and skipped:" >> $logFile;

for i in "${skippedFiles[@]}"
do
   echo "$i" >> $logFile;
done

echo "" >> $logFile;
echo "END" >> $logFile;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM