简体   繁体   English

查找中位数的简单方法

[英]Simple way to find median

I have a data file and I perform few operations on the data. 我有一个数据文件,并且对数据执行少量操作。 I can get solutions for all other operations just fine. 我可以为所有其他操作找到解决方案。 I am not able to calculate the median only. 我无法仅计算中位数。

Input: Few lines from huge input. 输入:来自大量输入的几行。

00904bcabb02 00904bf7d758 676.0
0030657cc312 00904b1f1154 120.0
00306597852d 00904b48a3b6 572.0
00904b1f1154 00904bcabb02 120.0
00904b1f1154 00904bf7d758 120.0
00904b48a3b6 00904ba7a3eb 572.0
00022d1aa531 0006254f5810 2.0
00022dac729c 0006254f5810 2.0
00022dbd5c9e 0006254f5810 2.0
0006254f5810 0050dad80267 2.0
0006254f5810 00904be2b271 2.0
00022d097904 004096f41eb8 20.0
00022d2d30dd 004096f41eb8 20.0
004096f41eb8 00904b1e7852 20.0
00022d1406df 00022d36a6df 8.0
00022d36a6df 00022d8cb682 8.0
00022d36a6df 0030654a05fa 8.0
0004230dd7de 000423cbac29 33.0
0004231e4f43 000423cbac29 33.0
0030659b49f1 00904b310619 29.0

For every pair of col[0] col[1] I find the freq and the corresponding value's Average and Sum. 对于每对col[0] col[1]我都会找到频率和相应值的平均值和总和。 I am trying to find the Median in set of pairtime . 我正在尝试在pairtime中找到中位数。 I am using numpy.median but that does not seem to be working. 我正在使用numpy.median但这似乎不起作用。 Any suggestion appreciated. 任何建议表示赞赏。 Thanks 谢谢

Code: 码:

from collections import defaultdict
import numpy as np
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)
timeavg = defaultdict(float)
timefreq = defaultdict(int)

#get number of pair occurrences and total time
with open('Input.txt', 'r') as f, open('Output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
    #timeavg = pairtime[pair]/paircount[pair]
    #pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())
    for pair, freq in paircount.iteritems():
        timeavg = pairtime[pair] / freq
        med = np.median(np.pairtime[pair])
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]

        o.write("%s %s %s %.2f %.2f %s \n" % (pair[0], pair[1], freq, pairtime[pair], timeavg, med))
print 'done'

Error: 错误:

 Traceback (most recent call last):
  File "pair_one.py", line 20, in <module>
    med = np.median(np.pairtime[pair])
AttributeError: 'module' object has no attribute 'pairtime'

Your error is not really anything to do with the median, so this post should have a different title! 您的错误与中位数无关,因此该帖子应使用其他标题!

When Python says need more than 2 values to unpack , look at the line it's complaining about. 当Python说need more than 2 values to unpack ,请查看它抱怨的那一行。 Your iteration wants med, pair, freq - in other words it wants three values at a time, while what you're giving it is the result of iteritems() . 您的迭代需要med, pair, freq iteritems()换句话说,一次需要三个值,而您给出的则是iteritems()的结果。 iteritems() will always give you two values at a time since it always returns (key, val) pairs. iteritems()总是返回(key, val)(key, val)因此iteritems()都会给您两个值。

I think you just need to remove med, from your for-loop. 我认为您只需要从for循环中删除med,

Your main problem is you're passing in a single floating point into the "median" function (pairtime[pair] contains the sum of the 3rd column values for the given c1,c2 pair). 您的主要问题是要将单个浮点传递到“中值”函数中(pairtime [pair]包含给定c1,c2对的第三列值的总和)。 You need to pass the list of values instead. 您需要改为传递值列表。 The way you calculate median is: 您计算中位数的方式是:

1) Take a list of numbers 1)列出数字

2) Sort it 2)排序

3) Pluck out the number in the exact center of the list. 3)在列表的正中央拔出数字。 This is the median. 这是中位数。

Here's my crack at a rewrite. 这是我的重写之道。 I have not run it, so there may be syntax issues. 我没有运行它,所以可能存在语法问题。 But it should give you the general idea. 但这应该给您大致的想法。

from collections import defaultdict
import numpy as np
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairtimelist = defaultdict(list)
pairper = defaultdict(float)
timeavg = defaultdict(float)
timefreq = defaultdict(int)

#get number of pair occurrences and total time
with open('Input.txt', 'r') as f, open('Output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
        pairtimelist[pair].append(pairtime[pair])
    #timeavg = pairtime[pair]/paircount[pair]
    #pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())
    for pair, freq in paircount.iteritems():
        timeavg = pairtime[pair] / freq
        med = np.median(pairtimelist[pair])
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]

        o.write("%s %s %s %.2f %.2f %s \n" % (pair[0], pair[1], freq, pairtime[pair], timeavg, med))
print 'done'

Median means a middle number of an array. 中位数表示数组的中间数。 Perhaps you mean this? 也许你是这个意思?

timelist=[]
for pair, freq in paircount.iteritems():
    timeavg = pairtime[pair] / freq
    parttimeArr=np.array(pairtime[pair])
    timelist.append(pairtime[pair])
timeArr=np.array(timelist)
median=np.median(timeArr)
print median

Replace: 更换:

med = np.median(np.pairtime[pair])

with: 与:

med = np.median(pairtime[pair])

pairtime is a local variable, and not a numpy attribute. pairtime是一个局部变量,而不是numpy属性。

EDIT 编辑

As @Fred S has pointed out, pairtime[pair] contains only the sum of the times, and not the complete series. 正如@Fred S指出的那样, pairtime[pair]仅包含时间的总和,而不包含完整的序列。 I didn't notice it before. 我以前没注意到。 Since you will calculate many statistics from the time series, I believe a better approach would be to keep the whole time series instead of just the sum as @Fred S did in his answer. 由于您将从时间序列中计算出许多统计数据,因此我认为更好的方法是保留整个时间序列,而不是像@Fred S在回答中所做的那样仅保留总和。 Then you can calculate all your statistics on the time series. 然后,您可以计算时间序列上的所有统计信息。

Here is a shot at a possible solution: 这是一个可能的解决方案的镜头:

from collections import defaultdict
import numpy as np
pairtimelist = defaultdict(list)

with open('Input.txt', 'r') as f, open('Output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        pairtimelist[pair].append(float(line[2]))
    for pair in pairtimelist.iterkeys():
        timeavg = np.mean(pairtimelist[pair])
        timemed = np.median(pairtimelist[pair])
        timesum = np.sum(pairtimelist[pair])
        freq = len(pairtimelist[pair])

        o.write("%s %s %s %.2f %.2f %s \n" % (pair[0], pair[1], freq, timesum, timeavg, timemed))

The reason for the error is that you are prefixing pairtime with np , and NumPy has no idea of what pairtime is. 发生错误的原因是您在pairtime前面pairtimenp ,而NumPy不知道pairtime是什么。 If the intention is to convert pairtime to a NumPy array, you should write np.array(pairtime) . 如果打算将pairtime转换为NumPy数组,则应编写np.array(pairtime) This should work, syntax wise: 在语法上,这应该可以工作:

from collections import defaultdict
import numpy as np
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)
timeavg = defaultdict(float)
timefreq = defaultdict(int)

#get number of pair occurrences and total time
with open('Input.txt', 'r') as f, open('Output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
    #timeavg = pairtime[pair]/paircount[pair]
    #pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())
    for pair, freq in paircount.iteritems():
        timeavg = pairtime[pair] / freq
        med = np.median(np.array(pairtime[pair]))
        # med = np.median(pairtime[pair]) # should work as well without np.array
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]

        o.write("%s %s %s %.2f %.2f %s \n" % (pair[0], pair[1], freq, pairtime[pair], timeavg, med))
print 'done'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM