簡體   English   中英

使用perl匹配txt文件中的變量集

[英]match set of variables in txt file using perl

我想將輸入文件中的一組變量與我的數據文件進行匹配,並返回各個字段。

input.txt中

ENSG00000165322
ENSG00000170540
ENSG00000143153
ENSG00000213145

data.txt文件包含多個字段(我認為)之間用半冒號(;)分隔。 這是一個例子:

chr10   gencodeV7   gene    32094365    32217742    0.714042    -   .   gene_id "ENSG00000165322.12"; transcript_ids "ENST00000311380.4,ENST00000375250.5,ENST00000492028.1,ENST00000497085.1,ENST00000493008.1,ENST00000344936.2,ENST00000396144.4,ENST00000375245.4,ENST00000477117.1,ENST00000497103.1,ENST00000454919.1,"; RPKM1 "7.54177"; RPKM2 "9.47656"; iIDR "0.000";
chr16   gencodeV7   gene    18802991    18812917    7.333434    -   .   gene_id "ENSG00000170540.7"; transcript_ids "ENST00000304414.6,ENST00000545430.1,ENST00000546206.1,"; RPKM1 "84.0696"; RPKM2 "90.714"; iIDR "0.000";

我想將input.txt中的每個變量與數據文件進行匹配,並用RPKM1打印出匹配項,將其關聯值用雙引號引起來,並將RPKM2值與其對應的值一起打印,以便它看起來像這樣,並且在沒有地方沒有匹配項以打印出N / A

ENSG00000165322 7.54177 9.47656
ENSG00000170540 84.0696 90.714
ENSG00000143153 73.2162 85.090
ENSG00000213145 N/A N/A

我可以使用以下腳本通過awk執行此操作:

exec < input.txt
while read line
            do
            set $line
                   rpkm=`grep $1 data.txt  | cut -f9| cut -d";" -f 3-4 | sed -e 's/;/\t/g'`
                   echo $line $rpkm >> output.txt

        done

但是我嘗試學習perl,經過數小時的搜索,我嘗試了一下,但是我不知道如何獲得輸出。

  use strict; 
  use warnings;
    my $input_txt = "input.txt" ;
    my $raw_data = "data.txt" ;
    if ($input_txt =~ $raw_data) ;
close $input

如果您有任何建議和解釋,那就太好了。

我的perl技能有些生銹,但是我為您結合了起來。 我用您在問題中提供的數據文件片段對其進行了測試,並且它可以工作(除了您提供的數據示例未為ENSG00000143153提供一行,因此輸出將顯示N / A)。

不確定您的gene_id是否包含或排除點后的內容。 在您的示例中,它似乎被排除在外,這就是我所做的。 (有一個注釋掉的正則表達式可以用,以防萬一我猜錯了)。

我試圖在perl代碼中添加足夠的注釋,以便您可以理解我在此過程中所做的事情。

希望這可以幫助你!

#!/usr/bin/perl
use strict;
use warnings;

my $input_file = 'input.txt';
my $data_file = 'data.txt';

# Read input file into array of variables
my @input_vars;
open my $input_file_handle, '<', $input_file or die $!;
while (<$input_file_handle>) {
  chomp $_;
  push @input_vars, $_;
}
close $input_file_handle;

# Read data file into array of data lines
my @data_lines;
open my $data_file_handle, '<', $data_file or die $!;
while (<$data_file_handle>) {
  chomp $_;
  push @data_lines, $_;
}
close $data_file_handle;

# Pare down data lines because we only care about gene_id, RPKM1, and RPKM2
# Create 2 associative arrays which store RPKM1 and RPKM2 values based on the gene_id as the key
my %rpkm1s;
my %rpkm2s;
foreach (@data_lines) {
  # If the gene id should exclude everything after the dot, as in your example.
  my $regex = 'gene_id(?:[ ]*)"(\w+)(?:\.\d+)?"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"';

  # If the gene id includes the dot and what's after it.
  # my $regex = 'gene_id(?:[ ]*)"(\w+\.\d+)"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"';

  while ($_ =~ m/$regex/g) {
    # $1 is gene_id, $2 is RPKM1, and $3 is RPKM2
    # Set RPKM1 value in array based on gene_id as the key
    $rpkm1s{$1} = $2;
    # Set RPKM2 value in array based on gene_id as the key
    $rpkm2s{$1} = $3;
  }
}

# Verify that I have gene_ids mapped to RPKM1 and RPKM2 values
#  while ((my $gene_id, my $rpkm1) = each(%rpkm1s)) {
#    print "GENE ID: $gene_id\n";
#    print "\tRPKM1: $rpkm1\n";
#    print "\tRPKM2: $rpkm2s{$gene_id}\n";
#    print "\n";
#  }

# Iterate through input variables, search for values in %rpkm1s and %rpkm2s
foreach (@input_vars) {
  print "$_ ";
  if (exists $rpkm1s{$_}) {
    print "$rpkm1s{$_} ";
  }
  else {
    print "N/A ";
  }

  if (exists $rpkm2s{$_}) {
    print "$rpkm2s{$_} ";
  }
  else {
    print "N/A ";
  }
  print "\n";
}

這是一個與您的變量匹配的正則表達式:

([a-z]{1}[A-Z]{3} "[0-9]\.[0-9]{3}")

我不熟悉PERL,但是此Regex將返回一組可以在其上迭代的變量

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM