[英]match set of variables in txt file using perl
我想將輸入文件中的一組變量與我的數據文件進行匹配,並返回各個字段。
input.txt中
ENSG00000165322
ENSG00000170540
ENSG00000143153
ENSG00000213145
data.txt文件包含多個字段(我認為)之間用半冒號(;)分隔。 這是一個例子:
chr10 gencodeV7 gene 32094365 32217742 0.714042 - . gene_id "ENSG00000165322.12"; transcript_ids "ENST00000311380.4,ENST00000375250.5,ENST00000492028.1,ENST00000497085.1,ENST00000493008.1,ENST00000344936.2,ENST00000396144.4,ENST00000375245.4,ENST00000477117.1,ENST00000497103.1,ENST00000454919.1,"; RPKM1 "7.54177"; RPKM2 "9.47656"; iIDR "0.000";
chr16 gencodeV7 gene 18802991 18812917 7.333434 - . gene_id "ENSG00000170540.7"; transcript_ids "ENST00000304414.6,ENST00000545430.1,ENST00000546206.1,"; RPKM1 "84.0696"; RPKM2 "90.714"; iIDR "0.000";
我想將input.txt中的每個變量與數據文件進行匹配,並用RPKM1打印出匹配項,將其關聯值用雙引號引起來,並將RPKM2值與其對應的值一起打印,以便它看起來像這樣,並且在沒有地方沒有匹配項以打印出N / A
ENSG00000165322 7.54177 9.47656
ENSG00000170540 84.0696 90.714
ENSG00000143153 73.2162 85.090
ENSG00000213145 N/A N/A
我可以使用以下腳本通過awk執行此操作:
exec < input.txt
while read line
do
set $line
rpkm=`grep $1 data.txt | cut -f9| cut -d";" -f 3-4 | sed -e 's/;/\t/g'`
echo $line $rpkm >> output.txt
done
但是我嘗試學習perl,經過數小時的搜索,我嘗試了一下,但是我不知道如何獲得輸出。
use strict;
use warnings;
my $input_txt = "input.txt" ;
my $raw_data = "data.txt" ;
if ($input_txt =~ $raw_data) ;
close $input
如果您有任何建議和解釋,那就太好了。
我的perl技能有些生銹,但是我為您結合了起來。 我用您在問題中提供的數據文件片段對其進行了測試,並且它可以工作(除了您提供的數據示例未為ENSG00000143153提供一行,因此輸出將顯示N / A)。
不確定您的gene_id是否包含或排除點后的內容。 在您的示例中,它似乎被排除在外,這就是我所做的。 (有一個注釋掉的正則表達式可以用,以防萬一我猜錯了)。
我試圖在perl代碼中添加足夠的注釋,以便您可以理解我在此過程中所做的事情。
希望這可以幫助你!
#!/usr/bin/perl
use strict;
use warnings;
my $input_file = 'input.txt';
my $data_file = 'data.txt';
# Read input file into array of variables
my @input_vars;
open my $input_file_handle, '<', $input_file or die $!;
while (<$input_file_handle>) {
chomp $_;
push @input_vars, $_;
}
close $input_file_handle;
# Read data file into array of data lines
my @data_lines;
open my $data_file_handle, '<', $data_file or die $!;
while (<$data_file_handle>) {
chomp $_;
push @data_lines, $_;
}
close $data_file_handle;
# Pare down data lines because we only care about gene_id, RPKM1, and RPKM2
# Create 2 associative arrays which store RPKM1 and RPKM2 values based on the gene_id as the key
my %rpkm1s;
my %rpkm2s;
foreach (@data_lines) {
# If the gene id should exclude everything after the dot, as in your example.
my $regex = 'gene_id(?:[ ]*)"(\w+)(?:\.\d+)?"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"';
# If the gene id includes the dot and what's after it.
# my $regex = 'gene_id(?:[ ]*)"(\w+\.\d+)"(?:.*)RPKM1(?:[ ]*)"([0-9\.]+)"(?:.*)RPKM2(?:[ ]*)"([0-9\.]+)"';
while ($_ =~ m/$regex/g) {
# $1 is gene_id, $2 is RPKM1, and $3 is RPKM2
# Set RPKM1 value in array based on gene_id as the key
$rpkm1s{$1} = $2;
# Set RPKM2 value in array based on gene_id as the key
$rpkm2s{$1} = $3;
}
}
# Verify that I have gene_ids mapped to RPKM1 and RPKM2 values
# while ((my $gene_id, my $rpkm1) = each(%rpkm1s)) {
# print "GENE ID: $gene_id\n";
# print "\tRPKM1: $rpkm1\n";
# print "\tRPKM2: $rpkm2s{$gene_id}\n";
# print "\n";
# }
# Iterate through input variables, search for values in %rpkm1s and %rpkm2s
foreach (@input_vars) {
print "$_ ";
if (exists $rpkm1s{$_}) {
print "$rpkm1s{$_} ";
}
else {
print "N/A ";
}
if (exists $rpkm2s{$_}) {
print "$rpkm2s{$_} ";
}
else {
print "N/A ";
}
print "\n";
}
這是一個與您的變量匹配的正則表達式:
([a-z]{1}[A-Z]{3} "[0-9]\.[0-9]{3}")
我不熟悉PERL,但是此Regex將返回一組可以在其上迭代的變量
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.