简体   繁体   English

模式匹配并用if else循环替换字符串

[英]pattern match and replace the string with if else loop

I have a file containing multiple lines starting with "1ECLI H--- 12.345 .....". 我有一个包含多行以“1ECLI H --- 12.345 .....”开头的文件。 I want to remove a space between I and H and add R/S/T upon iteration of the H pattern. 我想删除I和H之间的空格,并在迭代H模式时添加R / S / T. for eg. 例如。 H810 if repeated in consecutive three lines, it should get added with a letter R, S (second iteration), T (third iteration). H810如果连续三行重复,则应加上字母R,S(第二次迭代),T(第三次迭代)。 so it would be H810R. 所以它将是H810R。 Any help will be appreciated. 任何帮助将不胜感激。
text looks like below 文字如下所示

1ECLI  H813   98   7.529   8.326   9.267
1ECLI  H813   99   7.427   8.470   9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI  H814  101   7.607   8.617   9.289
1ECLI  H814  102   7.633   8.489   9.156
1ECLI  H814  103   7.721   8.509   9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

upon chage 在chage

1ECLI H813R   98   7.529   8.326   9.267
1ECLI H813S   99   7.427   8.470   9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI H814R  101   7.607   8.617   9.289
1ECLI H814s  102   7.633   8.489   9.156
1ECLI H814T  103   7.721   8.509   9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

Thanks. 谢谢。

If your Input_file is same as shown sample then could you please try following awk and let me know if this helps you. 如果您的Input_file与显示的示例相同,那么您可以尝试关注awk并告诉我这是否对您有所帮助。

awk '
BEGIN{
  val[1]="R";
  val[2]="S";
  val[3]="T"
}
$2 !~ /^H[0-9]+/ || i==3{
  i=""
}
$2 ~ /^H[0-9]+$/ && /^1ECLI/{
  $2=$2val[++i]
}
1
'   Input_file  > temp_file  && mv  temp_file   Input_file

Adding explanation also for answer too as follows. 也为答案添加说明如下。

awk '
BEGIN{                        ##Starting BEGIN section of awk here.
  val[1]="R";                 ##creating an array named val whose index is 1 and value is string R.
  val[2]="S";                 ##creating array val 2nd element here whose value is S.
  val[3]="T"                  ##creating array val 3rd element here whose value is T.
}
$2 !~ /^H[0-9]+/ || i==3{     ##Checking condition if 2nd field does not start from H and digits after that OR variable i value is equal to 3.
  i=""                        ##Then nullifying the value of variable i here.
}
$2 ~ /^H[0-9]+$/ && /^1ECLI/{ ##Checking here if 2nd field value is starts from H till all digits till end AND line starts from 1ECLI string then do following.
  $2=$2val[++i]               ##re-creating value of 2nd field by adding value of array val whose index is increasing value of variable i.
}
1                             ##Mentioning 1 here, which means it will print the current line.
' Input_file   > temp_file  && mv  temp_file   Input_file                 ##Mentioning Input_file name here.

Even below one can give desired output, if your real input file is same as what you have posted. 如果您的真实输入文件与您发布的文件相同,即使是低于一个也可以提供所需的输出。

awk 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}1' infile 

Explanation 说明

  • split("R,S,T",a,/,/) - split string "R,S,T" by separator comma, and save in array a , so it becomes a[1] = R, a[2] = S, a[3] = T split("R,S,T",a,/,/) - 用分隔符逗号分割字符串"R,S,T" ,并保存在数组a ,所以它变为a[1] = R, a[2] = S, a[3] = T

  • f=$2~/^H[0-9]+$/ - f is variable, validate regexp $2 ~ /^H[0-9]+$/ , which returns boolean status. f=$2~/^H[0-9]+$/ - f是变量,验证regexp $2 ~ /^H[0-9]+$/ ,返回布尔状态。 if it returned true then variable f will be true, otherwise false 如果它返回true则变量f将为真,否则为false

  • $2 = $2 a[++c] if above one was true, then modify second field, so second field will have existing value plus array a value, corresponding to the index ( c ), ++c is pre-increment variable $2 = $2 a[++c]如果高于1则为真,则修改第二个字段,因此第二个字段将具有现有值加数组a值,对应于索引( c ), ++c是预增量变量

  • !f{c=0} if variable f is false then reset variable c , not consecutive. !f{c=0}如果变量f为假,则重置变量c ,而不是连续。

  • 1 at the end does default operation that is print current/record/row, print $0 . 1默认操作是打印当前/记录/行, print $0 To know how awk works try, awk '1' infile , which will print all records/lines, whereas awk '0' infile prints nothing. 要知道awk是如何工作的,请使用awk '1' infile ,它将打印所有记录/行,而awk '0' infile打印任何内容。 Any number other than zero is true , which triggers the default behavior. 除零以外的任何数字都为true ,这会触发默认行为。

Test Results: 检测结果:

$ cat infile
1ECLI  H813   98   7.529   8.326   9.267
1ECLI  H813   99   7.427   8.470   9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI  H814  101   7.607   8.617   9.289
1ECLI  H814  102   7.633   8.489   9.156
1ECLI  H814  103   7.721   8.509   9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

$ awk 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}1' infile
1ECLI H813R 98 7.529 8.326 9.267
1ECLI H813S 99 7.427 8.470 9.251
1ECLI  C814  100   7.621   8.513   9.263
1ECLI H814R 101 7.607 8.617 9.289
1ECLI H814S 102 7.633 8.489 9.156
1ECLI H814T 103 7.721 8.509 9.305
1ECLI   C74  104   8.164   8.733  10.740
1ECLI  H74R  105   8.247   8.690  10.799

If you want better formatting like tab or some other char as field separator, then you may use below one, modify OFS variable 如果你想要更好的格式,如tab或其他一些char作为字段分隔符,那么你可以使用下面一个,修改OFS变量

$ awk -v OFS="\t" 'BEGIN{split("R,S,T",a,/,/)}f=$2~/^H[0-9]+$/{$2 = $2 a[++c]}!f{c=0}{$1=$1}1'  infile
1ECLI   H813R   98  7.529   8.326   9.267
1ECLI   H813S   99  7.427   8.470   9.251
1ECLI   C814    100 7.621   8.513   9.263
1ECLI   H814R   101 7.607   8.617   9.289
1ECLI   H814S   102 7.633   8.489   9.156
1ECLI   H814T   103 7.721   8.509   9.305
1ECLI   C74     104 8.164   8.733   10.740
1ECLI   H74R    105 8.247   8.690   10.799

The code below assumes that lines is a list of strings representing a line in your file. 下面的代码假定lines是表示文件中一行的字符串列表。


with open('filename') as f:
    lines = f.readlines()

from collections import defaultdict
cntd = defaultdict(lambda: 0)
suffix = ['R', 'S', 'T']
newlines = []
for line in lines:
    try:
        kwd = line.split()[1]
    except IndexError:
        newlines.append(line)
        continue
    if kwd[0] == 'H' and kwd[-1].isdigit():
        sfx = suffix[cntd[kwd]]
        idx = line.index(kwd)
        nl = line[:idx -1] + kwd + sfx + line[idx + len(kwd):]
        # nl = line[:idx + len(kwd)] + sfx + line[idx + len(kwd):] # adjust formatting to your taste
        newlines.append(nl)
        cntd[kwd] += 1
    else:
        newlines.append(line)

with open('filename', 'w') as f:
    f.writelines(newlines)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM