简体   繁体   English

查找价值中位数

[英]Find median of value

I have a data set which is a data set of gene nodes. 我有一个数据集,它是基因节点的数据集。 It has a pair of nodes and their is some value as weight given. 它有一对节点,它们的权重值是一定的。 I have to find a median of the corresponding gene pair. 我必须找到相应基因对的中位数。 I count the number of times the node pair occurs in entire data set and then calculate the median of the value. 我计算节点对在整个数据集中出现的次数,然后计算该值的中位数。 Here Col[0] and Col[1] are the node pairs and Col[2] is the weight. 这里Col[0] and Col[1]是节点对,而Col[2]是权重。 The code below prints the nodes and the median value for odd occurrences correct but for even occurrences it shows the larger value of the two middle values. 下面的代码可打印节点,并且奇数出现的中间值正确,但偶数出现的是两个中间值中的较大者。 Any suggestions appreciated. 任何建议表示赞赏。

Input Type:Small list from large file. 输入类型:大文件的小列表。

5372 937 65.0
4821 937 65.0
4376 937 65.0
2684 937 65.0
4391 3715 1880.0
3436 1174 2383.0
3436 3031 2383.0
3436 1349 2383.0
5372 937 70.0
4821 937 70.0
4376 937 70.0
2684 937 70.0
3826 896 10.0
3826 896 17.0
5372 937 62.0
4821 937 62.0
4376 937 62.0
2684 937 62.0
3826 896 50.0
4944 3715 482.0
4944 4391 482.0
2539 1431 323.0
5372 937 59.0
4821 937 59.0
4376 937 59.0
2684 937 59.0
896 606 11.0
3826 896 10.0
5045 4901 11.0
4921 4901 11.0
4901 3545 11.0
4901 3140 11.0
4901 4243 11.0

code: 码:

from collections import defaultdict
import numpy as np

pt  = defaultdict(float)
pm  = defaultdict(float)
pc  = defaultdict(int)
with open('input.txt', 'r') as f:
    with open('output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        pc[pair] += 1       
        pt[pair] = float(line[2])
        pm[pair] = np.median(pt[pair])
        print pair, pc[pair], pm[pair]

As per the definition median for even set of numbers is the average value of the two middle numbers and for odd set of number the middle value is the median. 根据定义,偶数组的中位数是两个中间数的平均值,而奇数组的中位数是中位数。 How can I get a better median value in case of even set of numbers? 在偶数集的情况下,如何获得更好的中位数?

Your pt dictionary is not right. 您的pt字典不正确。 You are storing the last weight of each pair, and to compute the median you need the whole list of weights. 您要存储每对的最后一个权重,并且要计算中位数,您需要整个权重列表。 You could do: 您可以这样做:

from collections import defaultdict
import numpy as np

pt  = defaultdict(list)
pc  = defaultdict(int)
with open('input.txt', 'r') as f:
    with open('output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        pc[pair] += 1       
        pt[pair].append(float(line[2]))

# now with the medians
pm  = dict()
for pair, weights in pt.items():
    pm[pair] = np.median(weights)
    print pair, pc[pair], pm[pair]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM