简体   繁体   English

在文件/数组中,搜索散列键,并将其替换为散列值,对所有散列键/值执行此操作

[英]In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. 我在网站上搜索过,令人惊讶的是我似乎无法找到适用于我的特定问题的东西。 So I figured I'd post it and see how some of you more experienced programmers can address with problem. 所以我想我会发布它,看看你们中有些经验丰富的程序员如何解决问题。

I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). 我有一个像文本文件这样的电子表格(很多行都带有制表符分隔的列),我想搜索某些标签(ex scaffold1253.1_size81005.6.32799_7496)并用更简化的标签替换它们(ex scaffold1253.1a)。 These labels are only in the first column of the text file. 这些标签仅位于文本文件的第一列。 I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. 我已经编写了脚本,以便我将旧标签的哈希作为与新标签对应的键作为各自的值。 This hash has about 26000 lines. 这个哈希有大约26000行。 So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values. 所以基本上我想把哈希键1做1,在文本文件中搜索它们,并用它们各自的哈希值替换它们。

I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok. 我有一个非常好的服务器可用,所以如果它太复杂,使其第一列特定加速过程那么没关系。

THis is what I have so far: 这是我到目前为止:

 use warnings;  



$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf'; 
    open(FASTAFILE2, $gtf);
    @gtfarray = <FASTAFILE2>;
    #print @gtfarray;


my %hash;
while (<>)
{
   chomp;
   my ($key, $val) = split /\t/;
   $hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}

#print %hash;

while (my ($find, $replace) = each %hash) {
    foreach (@gtfarray){
        $_ =~ s/$find/$replace/g;
        push @newgtf, $_;   
    }
}
print @newgtf;

This code doesn't seem to work as it doesn't complete. 此代码似乎不起作用,因为它没有完成。 I'm pretty sure it's a problem with the foreach loop structure. 我很确定这是foreach循环结构的问题。 Sorry I don't know of any other way to do this. 对不起,我不知道有任何其他方法可以做到这一点。 Does anyone have a better way to run through this file and conduct the replacement? 有没有人有更好的方法来运行此文件并进行替换?

Any input would be greatly appreciated! 任何投入将不胜感激! Thanks, 谢谢,

Andrew 安德鲁

@DVK @DVK

Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? 这是你的mods的完整脚本,你的while循环会遇到语法错误,不知道为什么它不接受它? Thanks again! 再次感谢!

use warnings;  

$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf'; 
    open(FASTAFILE2, $gtf);

my %hash;
while (<>){
    chomp;
    my ($key, $val) = split /\t/;
    $hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}


while $line (<FASTAFILE2>){
    my @fields = split(/\t/, $line);
    # If you only care about first column, don't need the foreach loop below;
    #    just do the loop insides on $fields[0]
    foreach my $field (@fields) {
        $field = $hash{$field} if exists $hash{$field};
        print $outfile "$field\t"; # Small bug - will print training \t
    }
    print $outfile "\n"
}

__END__

Here is the syntax error: perl gtf_mod2.pl <./Hc_genome/header_file.txt syntax error at gtf_mod2.pl line 14, near "while $line " syntax error at gtf_mod2.pl line 23, near "}" Execution of gtf_mod2.pl aborted due to compilation errors. 以下是语法错误:gtf_mod2.pl第14行的perl gtf_mod2.pl <./ Hc_genome / header_file.txt语法错误,gtf_mod2.pl第23行附近的“while $ line”语法错误,接近“}”执行gtf_mod2。由于编译错误导致pl中止。

You exhaust your file the first time through your loop using the initial $find and $replace key/value pair. 使用初始的$find$replace键/值对,第一次通过循环耗尽文件。

There are two potential solutions: 有两种可能的解决方案:

  1. Open the file for reading during each iteration of your while loop (expensive) 在while循环的每次迭代期间打开文件进行读取(昂贵)
  2. Move the foreach loop to the outside of the while and iterate the hash each time (less expensive) 将foreach循环移动到while的外部并每次迭代哈希 (更便宜)

example: 例:

REPLACE:
for my $line (@gtfarray) {
   while(my ($find, $replace) = each %hash) {
      if($line =~ s/$find/$replace/g) {
         push @newgtf, $line;
         next REPLACE; # skip to next iteration
      }
   }
   # if there was no replacement, push the old line
   push @newgtf, $line
}  

How big is the file that you are replacing the first column in? 您要替换第一列的文件有多大?

If it's >50,000 lines, you are better off doing the reverse : 如果是> 50,000行,你最好不要做相反的

  • Iterate through hash file once, and store that hash in memory 迭代一次哈希文件,并将该哈希值存储在内存中

  • Iterate through main file once , and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write. 迭代主文件一次 ,对于每一行,对于每一列,在记忆的散列中找到该值,如果找到则替换为散列值,并写入。

In other words, remove the first @gtfarray = <FASTAFILE2>; 换句话说,删除第一个@gtfarray = <FASTAFILE2>; and replace your last while loop with: 并使用以下命令替换上一个while循环:

while my $line (<FASTAFILE2>) {
    my @fields = split(/\t/, $line);
    # If you only care about first column, don't need the foreach loop below;
    #    just do the loop insides on $fields[0]
    foreach my $field (@fields) {
        $field = $hash{$field} if exists $hash{$field};
        print $outfile "$field\t"; # Small bug - will print training \t
    }
    print $outfile "\n";
}

NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (eg your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU"). 注意:我假设这些字段包含您的散列键的完整内容(例如,您的数据文件将包含带有“scaffold1253.1_size81005.6.32799_7496”的字段,但不包含带有“XYZscaffold1253.1_size81005.6.32799_7496 ___ IOU”的字段) 。

If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (eg "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is: 如果这个假设是错误的并且您确实需要运行正则表达式,因为您的脚手架字符串可能包含在更长的字符串中,除了运行O(N * M)正则表达式之外,可能还有更好的解决方案:如果您的脚手架字符串全部是某个定义良好的格式(例如“scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN”),您需要做的是:

  • For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis: 对于每行数据文件,运行单个正则表达式查找该模式,整个模式位于捕获组括号内:

     @matches = ($line =~ m/(scaffold\\d+\\.\\d+_size\\d+\\.\\d+\\.\\d+_\\d+/g ); 
  • Then, look up every value of @matches array in the hash. 然后,在哈希中查找@matches数组的每个值。 If found, run ONLY the matches as as/// regex. 如果找到,只运行匹配///正则表达式。

Could it be a job for Tie::File ? 这可能是Tie::File吗? Assuming, that is, the data file could be operated on as an array. 假设,也就是说,数据文件可以作为数组进行操作。

use Tie::File; 

my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf"; 

tie @lines, 'Tie::File', $file or die ;
for (@lines) {
 s/Oldlabel/NewLable/g;   # Change this to fit
}

untie @lines ;

Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient. Tie::File做了一系列技巧来保持对文件内存的“到位”更改效率。

Looking at your previous post , wouldn't it be more simple to create the shortened 'id' while reading the file. 查看以前的帖子 ,在阅读文件时创建缩短的“id”不是更简单。 Then you would have no need of the other file where you get your hash? 然后你不需要你得到哈希的其他文件?

Here is the (untested) code below. 这是下面的(未经测试的)代码。 (would need to direct the print statements to an output file on the command line or open a file for writing in your script). (需要将print语句指向命令行上的输出文件或打开文件以便在脚本中写入)。

#!/usr/bin/perl
use strict;
use warnings;

my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";

my %seen;

while (<$FASTAFILE2>) {
    chomp;
    my ($id, $val) = split /\t/, $_, 2;

    # copy $id to $prefix and
    # remove everything after '.1' in $prefix
    (my $prefix = $id) =~ s/\.1\K.*//; 

    if ($seen{$id}) {
        ++$seen{$id};
    }
    else {
        $seen{$id} = 'a';   
    }
    print "$prefix$seen{$id}\t$val\n";
}

close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM