简体   繁体   English

分档数据

[英]Binning the data

I have a dataset which looks like this 我有一个看起来像这样的数据集

-9030   KIR3DX1
-75     SLC12A6
8005    C14orf79
-251    ARAP1
65994   EFNB1
-12111  SLC7A5
-11643  CAMK2G
-19749  PRPS2
-23324  MIR198
10012   LOC100506172
-77     CCDC88A
12171   MMP14

Where column 1 represents the distance (in base pairs) of the elements (genes) in column 2 from 0 in either direction. 其中第1列表示第2列中元素(基因)在任一方向上从0开始的距离(以碱基对)。 I want to bin this data in a window of 50 base pairs. 我想在50个碱基对的窗口中存储这些数据。

Any suggestions? 有什么建议么? Thank you 谢谢

The program (hope Perl's OK): 该程序(希望Perl的确定):

#!/usr/bin/perl

# Create a histogram of some data

# Denis Howe 2012-07-03 - 2012-07-03 18:40

use strict;
use warnings;

# my $n_bins = 10;
my $width = 50;

# Read lines into @d
my @d = <DATA>; chomp @d;

# Split each line containing a digit into a pair
@d = map [split(/\s+/, $_)], grep /\d/, @d;

# Find range
my $min = 9E9; my $max = -9E9;
foreach (@d)
{
    $min = $_->[0] if ($_->[0] < $min);
    $max = $_->[0] if ($_->[0] > $max);
}

# Round down to multiple of $width
$min = int($min/$width) * $width;

# Ensure there's a bin for max value
# my $width = ($max*1.01 - $min) / $n_bins;
my $n_bins = int(($max - $min) / $width) + 1;

# Allocate data to bins
my @bin;
foreach (@d)
{
    push @{$bin[($_->[0]-$min)/$width]}, $_;
}

# Show content of each bin
foreach (0 .. $n_bins-1)
{
    next unless ($bin[$_]);             # Ignore empty bins
    printf "%6d - %6d", $min + $_*$width, $min + ($_+1)*$width;
    print map("  " . $_->[0] . ":" . $_->[1], @{$bin[$_]}), "\n";
}

__DATA__
-9030   KIR3DX1
-75     SLC12A6
8005    C14orf79
-251    ARAP1
65994   EFNB1
-12111  SLC7A5
-11643  CAMK2G
-19749  PRPS2
-23324  MIR198
10012   LOC100506172
-77     CCDC88A
12171   MMP14
EOF

The output: 输出:

-23300 - -23250  -23324:MIR198
-19750 - -19700  -19749:PRPS2
-12150 - -12100  -12111:SLC7A5
-11650 - -11600  -11643:CAMK2G
 -9050 -  -9000  -9030:KIR3DX1
  -300 -   -250  -251:ARAP1
  -100 -    -50  -75:SLC12A6  -77:CCDC88A
  8000 -   8050  8005:C14orf79
 10000 -  10050  10012:LOC100506172
 12150 -  12200  12171:MMP14
 65950 -  66000  65994:EFNB1

HtH HTH

Pretty simple one-liner: 非常简单的单行:

perl -MPOSIX=floor -anE'push@{$f{floor($F[0]/50)}},$F[1]}{$,=" ";for(sort{$a<=>$b}keys%f){$i=$_*50;say"$i -",$i+49,": @{$f{$_}}"}'

Note this one-liner produces correct output for testing data (look at -23324 MIR198 which is definitely in -23350 - -23301 for example): 注意这个单行产生测试数据的正确输出(例如-23324 MIR198,肯定在-23350 - -23301):

-23350 - -23301 : MIR198
-19750 - -19701 : PRPS2
-12150 - -12101 : SLC7A5
-11650 - -11601 : CAMK2G
-9050 - -9001 : KIR3DX1
-300 - -251 : ARAP1
-100 - -51 : SLC12A6 CCDC88A
8000 - 8049 : C14orf79
10000 - 10049 : LOC100506172
12150 - 12199 : MMP14
65950 - 65999 : EFNB1
s='''
-9030   KIR3DX1
-75     SLC12A6
8005    C14orf79
-251    ARAP1
65994   EFNB1
-12111  SLC7A5
-11643  CAMK2G
-19749  PRPS2
-23324  MIR198
10012   LOC100506172
-77     CCDC88A
12171   MMP14
'''

import re
from collections import defaultdict

bin = defaultdict( list )
for distance, gene in re.findall('^(\S+)\s+(\S+)',s,re.M):
    bin[int(distance)//50].append(gene)

print( bin )

does this work? 这有用吗?

 from collections import defaultdict

 binner=defaultdict(list)
 with open(datafilename) as f:
     for line in f:
         i=int(line.split()[0])
         binner[i//50].append(line)

I'm just binning the entire line since I don't know what information you actually want to keep in there (it's a little unclear with a dataset that messy)... 我只是整理整行,因为我不知道你实际上想要保留哪些信息(这有点不清楚,数据集很乱)...

I will assume that you want gene-names binned by distance in base pairs: 我假设您希望基因对中基因距离分类的基因名称:

from collections import defaultdict, Counter

bins = defaultdict(Counter)
binsize = 50

with open(datafile) as inf:
    for line in inf:
        data = line.split('<', 1)[0]
        offset, name = data.split()
        bins[int(offset)//binsize][name] += 1

then 然后

keys = sorted(bins)
for key in keys:
    values = ', '.join('{1} {0}'.format(a,b) for a,b in bins[key].most_common())
    print('{:>7} - {:>7} : {}'.format(binsize*key, binsize*(key+1)-1, values))

on your sample data results in 您的样本数据结果

 -23350 -  -23301 : 1 MIR198
 -19750 -  -19701 : 1 PRPS2
 -12150 -  -12101 : 1 SLC7A5
 -11650 -  -11601 : 1 CAMK2G
  -9050 -   -9001 : 1 KIR3DX1
   -300 -    -251 : 1 ARAP1
   -100 -     -51 : 1 CCDC88A, 1 SLC12A6
   8000 -    8049 : 1 C14orf79
  10000 -   10049 : 1 LOC100506172
  12150 -   12199 : 1 MMP14
  65950 -   65999 : 1 EFNB1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM