[英]Using pointers to lines in a file in perl
I'm trying to use some sort of pointers in perl so that I can look at two at two files that are sorted in alphabetical order and match things in both the files if they have the same name in the first column. 我试图在perl中使用某种指针,以便我可以查看两个按字母顺序排序的文件中的两个,如果两个文件在第一列中具有相同的名称,则它们将匹配这两个文件中的内容。 The way i'm searching through each file though is I'm looking at which lines first column is lower in alphabetical order and then moving the pointer on that file to the next line.
我搜索每个文件的方式是,我要查看第一行中哪些行的字母顺序较低,然后将文件上的指针移至下一行。 Somewhat similar to the pointers in merge sort.
有点类似于合并排序中的指针。 The code below is an example of what I want.
下面的代码是我想要的示例。
Using these two files. 使用这两个文件。
set1 设置1
apple 17 20
boombox 23 29
carl 25 29
cat 22 33
dog 27 44
set2 SET2
ants yes
boombox no
carl yes
dentist yes
dice no
dog no
I can make a script that does something like this 我可以做一个像这样的脚本
($name, $affirmation) = first line in set2; #part I'm confused about I just kind of need some sort of command of something that will do this
while (<>){
@set1 = split;
while ($name < set1[0]){
($name, $affirmation) = next line in set2; # part i'm confused about I just kind of need some sort of command of something that will do this
}
if ($name = $set[0]{
print @set1, $affirmation;
}
This is how I would run it 这就是我的运行方式
./script.txt set1
I would end up with 我最终会
boombox 23 29 no
carl 25 29 yes
dog 27 44 no
. 。
. 。
Edit: 编辑:
I tried some code in some of the answers to see if I could make some functional code out of it but I seem to be running into problems, and some of the syntax in the answers I could not understand so I'm having a lot of trouble figuring out how to debug or solve this. 我在一些答案中尝试了一些代码,以查看是否可以用它编写一些功能代码,但是我似乎遇到了问题,并且答案中的某些语法我无法理解,因此我遇到了很多弄清楚如何调试或解决此问题。
This is my specific example using the folllowing two text files 这是我使用以下两个文本文件的特定示例
text.txt 的text.txt
Apples 0 -1 -1 0 0 0 0 -1
Apricots 0 1 1 0 0 0 0 1
Fruit 0 -1 -1 0 0 0 0 -1
Grapes 0 -2 -1 0 0 0 0 -2
Oranges 0 1 1 0 0 0 0 -1
Peaches 0 -2 -1 0 0 0 0 -2
text2.txt text2.txt
Apples CHR1 + 1167628 1170420 1 1 N
Apricots CHR1 - 2115898 2144159 1 1 N
Oranges CHR1 - 19665266 19812066 1 1 N
Noidberry CHR1 - 1337728 1329993 1 1 N
Peaches CHR1 - 1337275 1342693 1 1 N
And this script script.pl 而这个脚本script.pl
#!/usr/bin/perl
use warnings;
my $file_1 = $ARGV[0];
my $file_2 = $ARGV[1];
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
open(my $single, '>', 'text.txt');
open(my $deep, '>', 'text2.txt');
OUTER: while (my $outer = <$fh1>){
chomp $outer;
@CopyNumber = split(' ', $outer);
($title, $title2) = split('\|', $CopyNumber[0]);
#print 'title: ',$title,' title2: ',$title2,"\n";
my $numLoss = 0;
my $deepLoss = 0;
for ($i = 1; $i <= $#CopyNumber; $i++){
#print "$CopyNumber[$i], $#CopyNumber, $i, \n";
if ($CopyNumber[$i] < 0){
$numLoss = $numLoss + 1;
if ($CopyNumber[$i] <-1){
$deepLoss = $deepLoss + 1;
}
}
}
if ($GeneSym and (($GeneSym cmp $title)==0)){ #or (($GeneSym cmp $title2)==0))){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
INNER: while (my $inner = <$fh2>){
($GeneSym, $Chrom, $Strand, $Start, $Stop, $MapId, $TotalMap, $AbnormalMerge, $Overlap) = split(' ', $inner);
$Chrom =~ s/CHR/hs/ee;
my $cmp = ($GeneSym cmp $title);
next OUTER if $cmp < 0;
if ($cmp==0){ #or (($GeneSym cmp $title2)==0)){
print $single $Chrom,"\t",$Start,"\t",$Stop,"\t",$numLoss/$#CopyNumber,"\n";
print $deep $Chrom,"\t",$Start,"\t",$Stop,"\t",$deepLoss/$#CopyNumber,"\n";
next OUTER;
}
}
}
If I run ./script.pl text.txt text2.txt I should get this printed into Number.txt 如果我运行./script.pl text.txt text2.txt,我应该将此打印到Number.txt中
//corresponding to columns 2,4,5 of text2.txt and the last column being the percentage of columns which have a number lower than 0 //对应于text2.txt的第2、4、5列,最后一列是数字小于0的列的百分比
hs1 1167628 1170420 0.375 //For Apples
hs1 2115898 2144159 0 //For Apricots
hs1 19665266 19812066 0.125 //For Oranges
hs1 1337275 1342693 0.375 //For Peaches
Instead I get this 相反,我得到这个
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 1337275 1342693 0.375
So I'm just getting an error where 所以我只是出现一个错误
hs1 19665266 19812066 0.125 //For Oranges
isn't printing 没有打印
Quite like you state, with: use cmp
for comparison, split
line into two terms. 就像您陈述的那样,使用:使用
cmp
进行比较, split
行分为两部分。
For each line of FILE1
file go through lines of FILE2
file, exiting when a match is found. 对于
FILE1
文件的每一行,都要经过FILE2
文件的各行,找到匹配项后退出。 Once the FILE2
overshoots alphabetically move to the next line of FILE1
. 一旦
FILE2
超调按字母顺序移动到FILE1
的下一行。
use warnings 'all';
use strict;
sub process {
my ($name, $affirm_1, $affirm_2) = @_;
print "$name $affirm_1 $affirm_2\n";
}
my $file_1 = 'set1.txt';
my $file_2 = 'set2.txt';
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($name_2, $affirm_2);
FILE1: while (my $line1 = <$fh1>) {
chomp $line1;
my ($name_1, $affirm_1) = split ' ', $line1, 2;
if ($name_2) {
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
FILE2: while (my $line2 = <$fh2>) {
chomp $line2;
($name_2, $affirm_2) = split ' ', $line2, 2;
my $cmp = $name_1 cmp $name_2;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process($name_1, $affirm_1, $affirm_2);
next FILE1;
}
}
}
Comments on a few remaining details. 对其他一些细节发表评论。
Once a FILE2
line "overshoots," in the next iteration of FILE1
we need to first check that line, before entering the FILE2
loop to iterate over its remaining lines. 一旦
FILE2
行“超调”,在FILE1
的下一次迭代中,我们需要先检查该行,然后再进入FILE2
循环以遍历其其余行。 For the first FILE1
line the $name_2
is still undef thus if ($name_2)
. 对于第一行
FILE1
, $name_2
仍然是undef,因此if ($name_2)
。
Updated for edited post. 已更新为已编辑帖子。
use warnings 'all';
use strict;
sub process_line {
my ($single, $deep, $rline, $GeneSym, $Chrom, $Start, $Stop) = @_;
my ($numLoss, $deepLoss) = calc_loss($rline);
$Chrom =~ s/CHR/hs/;
print $single (join "\t", $Chrom, $Start, $Stop, $numLoss/$#$rline), "\n";
print $deep (join "\t", $Chrom, $Start, $Stop, $deepLoss/$#$rline), "\n";
}
sub calc_loss {
my ($rline) = @_;
my ($numLoss, $deepLoss) = (0, 0);
for my $i (1.. $#$rline) {
$numLoss += 1 if $rline->[$i] < 0;
$deepLoss += 1 if $rline->[$i] < -1;
}
return $numLoss, $deepLoss;
}
my ($Number, $NumberDeep) = ('Number.txt', 'NumberDeep.txt');
open my $single, '>', $Number or die "Can't open $Number: $!";
open my $deep, '>', $NumberDeep or die "Can't open $NumberDeep: $!";
my ($file_1, $file_2) = ('set1_new.txt', 'set2_new.txt');
open my $fh1, '<', $file_1 or die "Can't open $file_1: $!";
open my $fh2, '<', $file_2 or die "Can't open $file_2: $!";
my ($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap);
FILE1: while (my $line1 = <$fh1>) {
next if $line1 =~ /^\s*$/;
chomp $line1;
my @line = split ' ', $line1;
if ($GeneSym) {
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \@line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
FILE2: while (<$fh2>) {
next if /^\s*$/;
chomp;
($GeneSym, $Chrom, $Strand, $Start, $Stop,
$MapId, $TotalMap, $AbnormalMerge, $Overlap) = split;
my $cmp = $line[0] cmp $GeneSym;
next FILE1 if $cmp < 0;
if ($cmp == 0) {
process_line($single, $deep, \@line,
$GeneSym, $Chrom, $Start, $Stop);
next FILE1;
}
}
}
This produces the desired output with given sample files. 这将在给定的示例文件中产生所需的输出。 Some shortcuts are taken, please let me know if comments would be helpful.
采取了一些捷径,请让我知道注释是否有帮助。 Here are a few
这里有一些
Much error checking should be added around. 应该增加很多错误检查。
I assume the first field of FILE1
to be used as it stands. 我假设要使用
FILE1
的第一个字段。 Otherwise changes are needed. 否则需要更改。
Processing is split into two functions, calculations being separate. 处理分为两个功能,计算是分开的。 This is not necessary.
这不是必需的。
$#$rline
is the index of the last element of $rline
arrayref. $#$rline
是$rline
arrayref的最后一个元素的索引。 If this is too much syntax to stomach use @$rline - 1
, for example as (0..@$rline-1)
如果语法太多,请使用
@$rline - 1
,例如(0..@$rline-1)
Some comments on the code posted in the question: 对问题中发布的代码的一些评论:
Always, always , please use warnings;
总是, 总是 ,请
use warnings;
(and use strict;
) (并
use strict;
)
loop over indices is best written foreach my $i (0..$#array)
最好在
foreach my $i (0..$#array)
写索引循环
The regex modifier /ee
is very involved. regex修饰符
/ee
非常有用。 There is absolutely no need for it here. 这里绝对没有必要。
You're right. 你是对的。 It's exactly like a merge sort, except only matching lines are output.
完全类似于合并排序,只输出匹配的行。
sub read_and_parse1 {
my ($fh) = @_;
defined( my $line = <$fh> )
or return undef;
my ($id, @copy) = split(' ', $line); # Use split(/\t/, $line) if tab-separated data
my ($gene_sym) = split(/\|/, $id);
return [ $gene_sym, @copy ];
}
sub read_and_parse2 {
my ($fh) = @_;
defined( my $line = <$fh> )
or return undef;
return [ split(' ', $line) ]; # Use split(/\t/, $line) if tab-separated data
}
my $fields1 = read_and_parse1($fh1);
my $fields2 = read_and_parse2($fh2);
while ($fields1 && $fields2) {
my $cmp = $fields1->[0] cmp $fields2->[0];
if ($cmp < 0) { $fields1 = read_and_parse1($fh1); }
elsif ($cmp > 0) { $fields2 = read_and_parse2($fh2); }
else {
my ($gene_sym, @copy) = @$fields1;
my (undef, $chrom, $strand, $start, $stop, $map_id, $total_map, $abnormal_merge, $overlap) = @$fields2;
$chrom =~ s/^CHR/hs/;
my $num_loss = grep { $_ < 0 } @copy;
my $deep_loss = grep { $_ < -1 } @copy;
print($single_fh join("\t", $chrom, $start, $stop, $num_loss/@copy ) . "\n");
print($deep_fh join("\t", $chrom, $start, $stop, $deep_loss/@copy ) . "\n");
$fields1 = read_and_parse1($fh1);
$fields2 = read_and_parse2($fh2);
}
}
Output: 输出:
$ cat single.txt
hs1 1167628 1170420 0.375
hs1 2115898 2144159 0
hs1 19665266 19812066 0.125
hs1 1337275 1342693 0.375
$ cat deep.txt
hs1 1167628 1170420 0
hs1 2115898 2144159 0
hs1 19665266 19812066 0
hs1 1337275 1342693 0.25
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.