简体   繁体   中英

Python summary statistics from counts dictionary

I am trying to gather summary statistics to generate a boxplot.
I have a dictionary where the keys are variables to be plotted on the y-axis and the values are their count in the data.

d = {16: 5, 
     21: 9, 
     44: 2, 
      2: 1}

I am wondering if there is a way to generate statistics such as median, Q1, Q3, etc. from the counts alone - I don't want to turn it into a list like [16, 16, 16, 16, 16, 21, 21, ...] and calculate from that. This is due to me trying to save a considerable amount of memory and not having to store the individual observations in memory.

EDIT
To be more concrete. Given an input

d = {4: 2, 10: 1, 3: 2, 11: 1, 18: 1, 12: 1, 14: 1, 16: 2, 7: 1}

I would like something that outputs

{'q1': 4, 'q2': 10.5, 'q3', 15, 'max': 18, 'min': 3}

Here is an idea. I have not dealt with all situations (eg when median index is not a whole number), but since get_val returns the result of a generator it should be memory-efficient.

from collections import OrderedDict
from itertools import accumulate

d = {16: 5, 
     21: 9, 
     44: 4, 
      2: 2}

d = OrderedDict(sorted(d.items()))
size = sum(d.values())
idx = {'q1': size/4,
       'q2': size/2,
       'q3': size*3/4}

# {'q1': 5.0, 'q2': 10.0, 'q3': 15.0}

def get_val(d, i):
    return next(k for k, x in zip(d, accumulate(d.values())) if i < x)

res = {k: get_val(d, v) for k, v in idx.items()}

# {'q1': 16, 'q2': 21, 'q3': 21}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM