简体   繁体   English

用具有匹配键的哈希值替换文件中的文本

[英]Replace text in file by hash values with matching keys

I would like to replace all words in a file matching the keys of my hash with corresponding values. 我想用相应的值替换匹配我的哈希键的文件中的所有单词。

Hash: 杂凑:

$VAR1 = {
    'asmbl_1'  => 'TCONS_00000046',
    'asmbl_2'  => 'TCONS_00000014',
    'asmbl_16' => 'MELO3C000012',
}

File: 文件:

CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|asmbl_2";

Desired output: 所需的输出:

CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|TCONS_00000014";

I'm looking for a straightforward way to do this, preferably in Perl, since I'm writing a script in Perl. 我正在寻找一种简单的方法来执行此操作,最好是在Perl中,因为我正在用Perl编写脚本。

Approaches: 方法:

  • Read the file line by line, extract the key from the file, match this key in hash and replace it by the value. 逐行读取文件,从文件中提取密钥,将其与哈希匹配,然后将其替换为值。
  • Read hash pair by pair, open file, read line by line and replace matches. 逐对读取哈希,打开文件,逐行读取并替换匹配项。

(What is the difference between these both methods?) (这两种方法有什么区别?)

  • Read hash pair by pair and call bash " sed -i '/key/value/' ". 逐对读取哈希对,并调用bash“ sed -i '/key/value/' ”。 A bit ugly, I would prefer to do all in Perl. 有点难看,我宁愿在Perl中做所有事情。

There's a nice trick I like, that basically involves building a regex and using that to capture and match your regex: 我喜欢一个不错的技巧,基本上涉及构建一个正则表达式,并使用它来捕获和匹配您的正则表达式:

use strict;
use warnings;

my %replace = (
    'asmbl_1'  => 'TCONS_00000046',
    'asmbl_2'  => 'TCONS_00000014',
    'asmbl_16' => 'MELO3C000012',
);

my $search = join( "|", map {quotemeta} sort { length ($b) <=> length ($a) } keys %replace );
$search = qr/\b($search)\b/;

while (<>) {
    s/$search/$replace{$1}/g;
    print;
}

Something like that produces the desired output. 诸如此类的东西会产生所需的输出。 (Diamond operators to read the content off STDIN or invocation via myscript.pl <some_File_To_process> (钻石运算符从STDIN读取内容或通过myscript.pl <some_File_To_process>调用

This is all that is necessary 这就是所有必要的

use strict;
use warnings;

my %map = (
    asmbl_1  => 'TCONS_00000046',
    asmbl_2  => 'TCONS_00000014',
    asmbl_16 => 'MELO3C000012',
);

my $re = join '|', map quotemeta, keys %map;

while ( <DATA> ) {
    s/\b($re)\b/$map{$1}/g;
    print;
}

__DATA__
CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|asmbl_1";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|asmbl_2";

output 输出

CM3.6.1_CONTIG30890 assembler   transcript  187 1568    .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30890 assembler   exon    187 251 .   -   .   gene_id "PASA_cluster_1"; transcript_id "align_id:184317|TCONS_00000046";
CM3.6.1_CONTIG30898 assembler   exon    1339    2793    .   -   .   gene_id "PASA_cluster_2"; transcript_id "align_id:184318|TCONS_00000014";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM