使用 Perl 進行 DNA 序列分析的線程

Question

我有一個示例 DNA 序列，如： S = ATGCGGGCGTGCTGCTGGGCTGCT....長度為 5MB。 另外，我有每個基因的基因坐標，例如：

Gene no. Start End
1          1    50
2         60    100
3         110   250
.....
4000      4640942 4641628

我的目標是對每個基因起始位置進行一定的計算。 我的代碼運行良好。 但是，它很慢。 我瀏覽了許多幫助頁面以使用線程使其更快，但不幸的是無法弄清楚。

這是我的代碼摘要：

foreach my $gene($sequence){
     my @coordinates = split("\t",$gene);
     $model1 = substr($sequence, $coordinates[1], 50);
     $model2 = substr($sequence, $coordinates[1], 60);
     $c-value = calculate($model1, $model2);
     ....
}

sub calculate {
     ......
}

如果有人能建議我如何並行化這種程序，我將不勝感激。 我想要並行的是計算每個基因的模型 1 和模型 2 之間的 c 值，這最終會加快這個過程。 我曾嘗試使用 Threads::queue 但以一堆錯誤結束。 我對 Perl 編程還很陌生，因此非常感謝任何幫助。

謝謝大家的意見和建議。 我已經修改了代碼，它似乎在使用 Perl 模塊 Parallel::ForkManager 工作。 該代碼成功地使用了我計算機的所有 4 個內核。

這是修改后的代碼：

    use strict;
    use warnings;
    use Data::Dumper;
    use Parallel::ForkManager;
    my $threads = 4;
    my $pm = new Parallel::ForkManager($threads);
    my $i = 1; #gene number counter
    $pm -> run_on_finish( sub { $i++; print STDERR "Checked $i genes" if ($i % $number_of_genes == 0); } ); 
    my @store_c_value = ();
    foreach my $gene($sequence){
                 my $pid = $pm->start and next;
                 my @coordinates = split("\t",$gene);
                 my $model1 = substr($sequence, $coordinates[1], 50);
                 my $model2 = substr($sequence, $coordinates[1], 60);
                 my $c-value = calculate($model1, $model2);
                 push(@store_c_value, $c-value);
                 $i++;
                 $pm->finish;
            }
    $pm->wait_all_children;
            sub calculate {
                 ......
                 return ($c-value);
            }
    print Dumper \@store_c_value;

當前的問題是我沒有得到@store_c_value任何輸出（即空數組）。 我發現您無法將子進程中的數據存儲到在主程序中聲明的數組中。 我知道我可以將它打印到外部文件，但我希望此數據位於@store_c_value數組中，因為我稍后將在程序中再次使用它。

再次感謝你幫助我。

Answer 1

一個選項是IO::Async::Function ，它將根據您使用的操作系統使用分叉或線程（分叉在 Unixy 系統上效率更高），並維護一組工作進程/線程以並行運行代碼. 它返回Future實例，可用於根據需要同步異步代碼。 有很多使用 Future 的方法，下面介紹了幾種。

use strict;
use warnings;
use IO::Async::Loop;
use IO::Async::Function;
use Future;

my $loop = IO::Async::Loop->new;
# additional options can be passed to the IO::Async::Function constructor to control how the workers are managed
my $function = IO::Async::Function->new(code => \&calculate);
$loop->add($function);

my @futures;
foreach my $gene($sequence){
     my @coordinates = split("\t",$gene);
     my $model1 = substr($sequence, $coordinates[1], 50);
     my $model2 = substr($sequence, $coordinates[1], 60);
     push @futures, $function->call(args => [$model1, $model2])->on_done(sub {
         my $c_value = shift;
         # further code using $c_value must be here, to be run once the calculation is done
     })->on_fail(sub {
         warn "Error in calculation for $gene: $_[0]\n";
     });
}

# wait for all calculations and on_done handlers before continuing
Future->wait_all(@futures)->await;

如果您希望程序在其中一個計算中出現異常時立即停止，您可以使用 Needs_all 並刪除單個 on_fail 處理程序，並使用get ，它是await的包裝，然后將按順序返回所有 c 值如果他們成功，或者如果失敗則拋出異常。

my @c_values = Future->needs_all(@futures)->get;

使用 Perl 進行 DNA 序列分析的線程

問題描述

1 個解決方案

解決方案1
1 2018-10-28 19:46:55

使用 Perl 進行 DNA 序列分析的線程

問題描述

1 個解決方案

解決方案1 1 2018-10-28 19:46:55

解決方案1
1 2018-10-28 19:46:55