簡體   English   中英

從值字典中找到中位數及其出現次數?

[英]Find the median from a dictionary of values and number of their occurences?

我有一本看起來像這樣的字典(雖然大得多):

data = {
    100: 8,
    110: 2,
    1000: 4,
    2200: 3,
    4000: 1,
    11000: 1,
}

每對包含值:我的數據集中出現的次數 我需要計算數據集的中位數。 任何提示/想法如何去做?

我正在使用 Python 3.6

編輯:

我不想創建列表(因為我的數據集的大小)。 列表的大小實際上是改用字典的原因。 所以,我正在尋找另一種方式。

我相信這個解決方案同樣有效,至少對於正數是這樣。 我結合您的回答測試了一些數據集,據我所知,它們的工作方式相似。

(sorted_dict 是按其鍵值排序的字典)

    length = 0
    for value in sorted_dict.values():
        length += value
    half = length / 2
    sum_var = 0
    #finds the index of the middle of the dataset
    for val in sorted_dict.values():
        if half-sum_var > 0:
            sum_var += val
        else:
            break
    index = (list(sorted_dict.values()).index(val))
    #returns the median based off some characteristics of the dataset
    if sum(list(sorted_dict.values())[index:]) != sum(list(sorted_dict.values())[:index]):
        if sum(list(sorted_dict.values())[index:]) > sum(list(sorted_dict.values())[:index]):
            median = list(sorted_dict.keys())[index]
        else:
            median = list(sorted_dict.keys())[index-1]
    else:
        median = (list(sorted_dict.keys())[index-1] + list(sorted_dict.keys())[index]) / 2
    return(median)

當您訂購 dict 時,這將適用於 python 3.6+。

from math import floor, ceil

def find_weighted_median(d):
    median_location = sum(d.values()) / 2
    lower_location = floor(median_location)
    upper_location = ceil(median_location)
    lower = None
    upper = None
    running_total = 0
    for val, count in d.items():
        if not lower and running_total <= lower_location <= running_total + count:
            lower = val
        if running_total <= upper_location <= running_total + count:
            upper = val
        if lower and upper:
            return (lower + upper) / 2
        running_total += count

所以,沒有找到令人滿意的答案,這就是我想出的:

from collections import OrderedDict
import statistics

d = {
 100: 8,
 110: 2,
 1000: 4,
 2200: 3,
 4000: 1,
 11000: 1,
}

    # Sort the dictionary
values_sorted = OrderedDict(sorted(d.items(), key=lambda t: t[0]))
index = sum(values_sorted.values())/2

# Decide whether the number of records is an even or odd number
if (index).is_integer():
    even = True
else: 
    even = False

x = True

# Compute median
for value, occurences in values_sorted.items():
    index -= occurences
    if index < 0 and x is True:
        median_manual = value
        break
    elif index == 0 and even is True:
        median_manual = value/2
        x = False
    elif index < 0 and x is False:

        median_manual += value/2
        break

# Create a list of all records and compute median using statistics package
values_list = list()
for val, count in d.items():
    for count in range(count):
        values_list.append(val)

median_computed = statistics.median(values_list)

# Test the two results are equal
if median_manual != median_computed:
    raise RuntimeError

我用不同的數據集對其進行了測試,並將結果與 statistics.median() 計算的中位數進行了比較,結果是相同的。

下面是一個基於熊貓的解決方案。

import pandas as pd

def getMed(item_dict : dict[int, int]) -> int:
    'function finds median'
    df = pd.DataFrame.from_dict(item_dict, orient='index').reset_index()
    df.columns = ['values', 'count']
    df.sort_values('values', inplace=True)
    df['cum_sum'] = df['count'].cumsum()
    total_count = df.iloc[-1, -1]
    for id, row in df.iterrows():
        if row['cum_sum'] >= int(total_count*0.5):
            return row['values']

您輸入的結果:

your_dict = {100: 8,
             110: 2,
             1000: 4,
             2200: 3,
             4000: 1,
             11000: 1
            }

getMed(your_dict)
>> 110

這是我的看法:

data = {
    100: 8,
    110: 2,
    1000: 4,
    2200: 3,
    4000: 1,
    11000: 2,
}
total_frequency = sum([v for v in data.values()])           # 1
middles = (total_frequency+1)//2, (total_frequency+2)//2    # 2

cumulated, first, second = 0, None, None

for key, frequency in data.items():                         # 3
    cumulated += frequency                                  # 3
    if (not first) and cumulated >= middles[0]:             # 4
        first = key
    if (not second) and cumulated >= middles[1]:            # 4
        second = key


median = (first+second)/2                                   # 5

print(f'''
Middle Frequencies: {middles[0]},{middles[1]}
Middle Values: {first},{second}
Median: {median}
''')

步驟是:

  1. 計算表的總頻率,即字典中的值。
  2. 找到兩個中間頻率。 如果有一個奇數,他們將是相同的。
  3. 遍歷表格並累積頻率。
  4. 如果累積的頻率已經達到其中一個,存儲密鑰。
  5. 中位數將是兩者的平均值。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM