[英]Find the median from a dictionary of values and number of their occurences?
我有一本看起來像這樣的字典(雖然大得多):
data = {
100: 8,
110: 2,
1000: 4,
2200: 3,
4000: 1,
11000: 1,
}
每對包含值:我的數據集中出現的次數。 我需要計算數據集的中位數。 任何提示/想法如何去做?
我正在使用 Python 3.6
編輯:
我不想創建列表(因為我的數據集的大小)。 列表的大小實際上是改用字典的原因。 所以,我正在尋找另一種方式。
我相信這個解決方案同樣有效,至少對於正數是這樣。 我結合您的回答測試了一些數據集,據我所知,它們的工作方式相似。
(sorted_dict 是按其鍵值排序的字典)
length = 0
for value in sorted_dict.values():
length += value
half = length / 2
sum_var = 0
#finds the index of the middle of the dataset
for val in sorted_dict.values():
if half-sum_var > 0:
sum_var += val
else:
break
index = (list(sorted_dict.values()).index(val))
#returns the median based off some characteristics of the dataset
if sum(list(sorted_dict.values())[index:]) != sum(list(sorted_dict.values())[:index]):
if sum(list(sorted_dict.values())[index:]) > sum(list(sorted_dict.values())[:index]):
median = list(sorted_dict.keys())[index]
else:
median = list(sorted_dict.keys())[index-1]
else:
median = (list(sorted_dict.keys())[index-1] + list(sorted_dict.keys())[index]) / 2
return(median)
當您訂購 dict 時,這將適用於 python 3.6+。
from math import floor, ceil
def find_weighted_median(d):
median_location = sum(d.values()) / 2
lower_location = floor(median_location)
upper_location = ceil(median_location)
lower = None
upper = None
running_total = 0
for val, count in d.items():
if not lower and running_total <= lower_location <= running_total + count:
lower = val
if running_total <= upper_location <= running_total + count:
upper = val
if lower and upper:
return (lower + upper) / 2
running_total += count
所以,沒有找到令人滿意的答案,這就是我想出的:
from collections import OrderedDict
import statistics
d = {
100: 8,
110: 2,
1000: 4,
2200: 3,
4000: 1,
11000: 1,
}
# Sort the dictionary
values_sorted = OrderedDict(sorted(d.items(), key=lambda t: t[0]))
index = sum(values_sorted.values())/2
# Decide whether the number of records is an even or odd number
if (index).is_integer():
even = True
else:
even = False
x = True
# Compute median
for value, occurences in values_sorted.items():
index -= occurences
if index < 0 and x is True:
median_manual = value
break
elif index == 0 and even is True:
median_manual = value/2
x = False
elif index < 0 and x is False:
median_manual += value/2
break
# Create a list of all records and compute median using statistics package
values_list = list()
for val, count in d.items():
for count in range(count):
values_list.append(val)
median_computed = statistics.median(values_list)
# Test the two results are equal
if median_manual != median_computed:
raise RuntimeError
我用不同的數據集對其進行了測試,並將結果與 statistics.median() 計算的中位數進行了比較,結果是相同的。
下面是一個基於熊貓的解決方案。
import pandas as pd
def getMed(item_dict : dict[int, int]) -> int:
'function finds median'
df = pd.DataFrame.from_dict(item_dict, orient='index').reset_index()
df.columns = ['values', 'count']
df.sort_values('values', inplace=True)
df['cum_sum'] = df['count'].cumsum()
total_count = df.iloc[-1, -1]
for id, row in df.iterrows():
if row['cum_sum'] >= int(total_count*0.5):
return row['values']
您輸入的結果:
your_dict = {100: 8,
110: 2,
1000: 4,
2200: 3,
4000: 1,
11000: 1
}
getMed(your_dict)
>> 110
這是我的看法:
data = {
100: 8,
110: 2,
1000: 4,
2200: 3,
4000: 1,
11000: 2,
}
total_frequency = sum([v for v in data.values()]) # 1
middles = (total_frequency+1)//2, (total_frequency+2)//2 # 2
cumulated, first, second = 0, None, None
for key, frequency in data.items(): # 3
cumulated += frequency # 3
if (not first) and cumulated >= middles[0]: # 4
first = key
if (not second) and cumulated >= middles[1]: # 4
second = key
median = (first+second)/2 # 5
print(f'''
Middle Frequencies: {middles[0]},{middles[1]}
Middle Values: {first},{second}
Median: {median}
''')
步驟是:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.