[英]Binning the data
I have a dataset which looks like this 我有一个看起来像这样的数据集
-9030 KIR3DX1
-75 SLC12A6
8005 C14orf79
-251 ARAP1
65994 EFNB1
-12111 SLC7A5
-11643 CAMK2G
-19749 PRPS2
-23324 MIR198
10012 LOC100506172
-77 CCDC88A
12171 MMP14
Where column 1 represents the distance (in base pairs) of the elements (genes) in column 2 from 0 in either direction. 其中第1列表示第2列中元素(基因)在任一方向上从0开始的距离(以碱基对)。 I want to bin this data in a window of 50 base pairs. 我想在50个碱基对的窗口中存储这些数据。
Any suggestions? 有什么建议么? Thank you 谢谢
The program (hope Perl's OK): 该程序(希望Perl的确定):
#!/usr/bin/perl
# Create a histogram of some data
# Denis Howe 2012-07-03 - 2012-07-03 18:40
use strict;
use warnings;
# my $n_bins = 10;
my $width = 50;
# Read lines into @d
my @d = <DATA>; chomp @d;
# Split each line containing a digit into a pair
@d = map [split(/\s+/, $_)], grep /\d/, @d;
# Find range
my $min = 9E9; my $max = -9E9;
foreach (@d)
{
$min = $_->[0] if ($_->[0] < $min);
$max = $_->[0] if ($_->[0] > $max);
}
# Round down to multiple of $width
$min = int($min/$width) * $width;
# Ensure there's a bin for max value
# my $width = ($max*1.01 - $min) / $n_bins;
my $n_bins = int(($max - $min) / $width) + 1;
# Allocate data to bins
my @bin;
foreach (@d)
{
push @{$bin[($_->[0]-$min)/$width]}, $_;
}
# Show content of each bin
foreach (0 .. $n_bins-1)
{
next unless ($bin[$_]); # Ignore empty bins
printf "%6d - %6d", $min + $_*$width, $min + ($_+1)*$width;
print map(" " . $_->[0] . ":" . $_->[1], @{$bin[$_]}), "\n";
}
__DATA__
-9030 KIR3DX1
-75 SLC12A6
8005 C14orf79
-251 ARAP1
65994 EFNB1
-12111 SLC7A5
-11643 CAMK2G
-19749 PRPS2
-23324 MIR198
10012 LOC100506172
-77 CCDC88A
12171 MMP14
EOF
The output: 输出:
-23300 - -23250 -23324:MIR198
-19750 - -19700 -19749:PRPS2
-12150 - -12100 -12111:SLC7A5
-11650 - -11600 -11643:CAMK2G
-9050 - -9000 -9030:KIR3DX1
-300 - -250 -251:ARAP1
-100 - -50 -75:SLC12A6 -77:CCDC88A
8000 - 8050 8005:C14orf79
10000 - 10050 10012:LOC100506172
12150 - 12200 12171:MMP14
65950 - 66000 65994:EFNB1
HtH HTH
Pretty simple one-liner: 非常简单的单行:
perl -MPOSIX=floor -anE'push@{$f{floor($F[0]/50)}},$F[1]}{$,=" ";for(sort{$a<=>$b}keys%f){$i=$_*50;say"$i -",$i+49,": @{$f{$_}}"}'
Note this one-liner produces correct output for testing data (look at -23324 MIR198 which is definitely in -23350 - -23301 for example): 注意这个单行产生测试数据的正确输出(例如-23324 MIR198,肯定在-23350 - -23301):
-23350 - -23301 : MIR198
-19750 - -19701 : PRPS2
-12150 - -12101 : SLC7A5
-11650 - -11601 : CAMK2G
-9050 - -9001 : KIR3DX1
-300 - -251 : ARAP1
-100 - -51 : SLC12A6 CCDC88A
8000 - 8049 : C14orf79
10000 - 10049 : LOC100506172
12150 - 12199 : MMP14
65950 - 65999 : EFNB1
s='''
-9030 KIR3DX1
-75 SLC12A6
8005 C14orf79
-251 ARAP1
65994 EFNB1
-12111 SLC7A5
-11643 CAMK2G
-19749 PRPS2
-23324 MIR198
10012 LOC100506172
-77 CCDC88A
12171 MMP14
'''
import re
from collections import defaultdict
bin = defaultdict( list )
for distance, gene in re.findall('^(\S+)\s+(\S+)',s,re.M):
bin[int(distance)//50].append(gene)
print( bin )
does this work? 这有用吗?
from collections import defaultdict
binner=defaultdict(list)
with open(datafilename) as f:
for line in f:
i=int(line.split()[0])
binner[i//50].append(line)
I'm just binning the entire line since I don't know what information you actually want to keep in there (it's a little unclear with a dataset that messy)... 我只是整理整行,因为我不知道你实际上想要保留哪些信息(这有点不清楚,数据集很乱)...
I will assume that you want gene-names binned by distance in base pairs: 我假设您希望基因对中基因距离分类的基因名称:
from collections import defaultdict, Counter
bins = defaultdict(Counter)
binsize = 50
with open(datafile) as inf:
for line in inf:
data = line.split('<', 1)[0]
offset, name = data.split()
bins[int(offset)//binsize][name] += 1
then 然后
keys = sorted(bins)
for key in keys:
values = ', '.join('{1} {0}'.format(a,b) for a,b in bins[key].most_common())
print('{:>7} - {:>7} : {}'.format(binsize*key, binsize*(key+1)-1, values))
on your sample data results in 您的样本数据结果
-23350 - -23301 : 1 MIR198
-19750 - -19701 : 1 PRPS2
-12150 - -12101 : 1 SLC7A5
-11650 - -11601 : 1 CAMK2G
-9050 - -9001 : 1 KIR3DX1
-300 - -251 : 1 ARAP1
-100 - -51 : 1 CCDC88A, 1 SLC12A6
8000 - 8049 : 1 C14orf79
10000 - 10049 : 1 LOC100506172
12150 - 12199 : 1 MMP14
65950 - 65999 : 1 EFNB1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.