[英]In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values
I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. 我在网站上搜索过,令人惊讶的是我似乎无法找到适用于我的特定问题的东西。 So I figured I'd post it and see how some of you more experienced programmers can address with problem.
所以我想我会发布它,看看你们中有些经验丰富的程序员如何解决问题。
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). 我有一个像文本文件这样的电子表格(很多行都带有制表符分隔的列),我想搜索某些标签(ex scaffold1253.1_size81005.6.32799_7496)并用更简化的标签替换它们(ex scaffold1253.1a)。 These labels are only in the first column of the text file.
这些标签仅位于文本文件的第一列。 I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values.
我已经编写了脚本,以便我将旧标签的哈希作为与新标签对应的键作为各自的值。 This hash has about 26000 lines.
这个哈希有大约26000行。 So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
所以基本上我想把哈希键1做1,在文本文件中搜索它们,并用它们各自的哈希值替换它们。
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok. 我有一个非常好的服务器可用,所以如果它太复杂,使其第一列特定加速过程那么没关系。
THis is what I have so far: 这是我到目前为止:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
@gtfarray = <FASTAFILE2>;
#print @gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (@gtfarray){
$_ =~ s/$find/$replace/g;
push @newgtf, $_;
}
}
print @newgtf;
This code doesn't seem to work as it doesn't complete. 此代码似乎不起作用,因为它没有完成。 I'm pretty sure it's a problem with the foreach loop structure.
我很确定这是foreach循环结构的问题。 Sorry I don't know of any other way to do this.
对不起,我不知道有任何其他方法可以做到这一点。 Does anyone have a better way to run through this file and conduct the replacement?
有没有人有更好的方法来运行此文件并进行替换?
Any input would be greatly appreciated! 任何投入将不胜感激! Thanks,
谢谢,
Andrew 安德鲁
@DVK @DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? 这是你的mods的完整脚本,你的while循环会遇到语法错误,不知道为什么它不接受它? Thanks again!
再次感谢!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my @fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (@fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error: perl gtf_mod2.pl <./Hc_genome/header_file.txt syntax error at gtf_mod2.pl line 14, near "while $line " syntax error at gtf_mod2.pl line 23, near "}" Execution of gtf_mod2.pl aborted due to compilation errors. 以下是语法错误:gtf_mod2.pl第14行的perl gtf_mod2.pl <./ Hc_genome / header_file.txt语法错误,gtf_mod2.pl第23行附近的“while $ line”语法错误,接近“}”执行gtf_mod2。由于编译错误导致pl中止。
You exhaust your file the first time through your loop using the initial $find
and $replace
key/value pair. 使用初始的
$find
和$replace
键/值对,第一次通过循环耗尽文件。
There are two potential solutions: 有两种可能的解决方案:
example: 例:
REPLACE:
for my $line (@gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push @newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push @newgtf, $line
}
How big is the file that you are replacing the first column in? 您要替换第一列的文件有多大?
If it's >50,000 lines, you are better off doing the reverse : 如果是> 50,000行,你最好不要做相反的 :
Iterate through hash file once, and store that hash in memory 迭代一次哈希文件,并将该哈希值存储在内存中
Iterate through main file once , and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write. 迭代主文件一次 ,对于每一行,对于每一列,在记忆的散列中找到该值,如果找到则替换为散列值,并写入。
In other words, remove the first @gtfarray = <FASTAFILE2>;
换句话说,删除第一个
@gtfarray = <FASTAFILE2>;
and replace your last while loop with: 并使用以下命令替换上一个while循环:
while my $line (<FASTAFILE2>) {
my @fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (@fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (eg your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU"). 注意:我假设这些字段包含您的散列键的完整内容(例如,您的数据文件将包含带有“scaffold1253.1_size81005.6.32799_7496”的字段,但不包含带有“XYZscaffold1253.1_size81005.6.32799_7496 ___ IOU”的字段) 。
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (eg "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is: 如果这个假设是错误的并且您确实需要运行正则表达式,因为您的脚手架字符串可能包含在更长的字符串中,除了运行O(N * M)正则表达式之外,可能还有更好的解决方案:如果您的脚手架字符串全部是某个定义良好的格式(例如“scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN”),您需要做的是:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis: 对于每行数据文件,运行单个正则表达式查找该模式,整个模式位于捕获组括号内:
@matches = ($line =~ m/(scaffold\\d+\\.\\d+_size\\d+\\.\\d+\\.\\d+_\\d+/g );
Then, look up every value of @matches array in the hash. 然后,在哈希中查找@matches数组的每个值。 If found, run ONLY the matches as as/// regex.
如果找到,只运行匹配///正则表达式。
Could it be a job for Tie::File
? 这可能是
Tie::File
吗? Assuming, that is, the data file could be operated on as an array. 假设,也就是说,数据文件可以作为数组进行操作。
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie @lines, 'Tie::File', $file or die ;
for (@lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie @lines ;
Tie::File
does a bunch of tricks to keep the "in place " changes to the file memory efficient. Tie::File
做了一系列技巧来保持对文件内存的“到位”更改效率。
Looking at your previous post , wouldn't it be more simple to create the shortened 'id' while reading the file. 查看以前的帖子 ,在阅读文件时创建缩短的“id”不是更简单。 Then you would have no need of the other file where you get your hash?
然后你不需要你得到哈希的其他文件?
Here is the (untested) code below. 这是下面的(未经测试的)代码。 (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
(需要将print语句指向命令行上的输出文件或打开文件以便在脚本中写入)。
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.