[英]Add mean, median and standard deviation values as new array columns in Python
[英]Python/Pandas for solving grouped mean, median, mode and standard deviation
我有以下數據:
[4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
我需要根據上面的數據構建這樣的計數/頻率表:
4.1 - 4.5: 8
4.6 - 5.0: 4
5.1 - 5.5: 10
5.6 - 6.0: 6
6.1 - 6.5: 7
6.6 - 7.0: 5
我得到的最接近的結果是:
counts freqs
categories
[4.1, 4.6) 8 0.200
[4.6, 5.1) 4 0.100
[5.1, 5.6) 10 0.250
[5.6, 6.1) 6 0.150
[6.1, 6.6) 7 0.175
[6.6, 7.1) 5 0.125
通過此代碼:
sr = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
ncut = pd.cut(sr, [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1],right=False)
srpd = pd.DataFrame(ncut.describe())
我需要創建一個新列,它是“類別”值的中值(例如,對於“ [4.1,4.6)”,它包含從4.1到4.5(不包括4.6)的數據計數/頻率),所以我需要獲得(4.1 + 4.5)/ 2,等於4.3。
這是我的問題:
1)如何訪問“類別”索引下的值以將其用於上述計算?
2)有沒有辦法以這種方式反映范圍:4.1-4.5、4.6到5.0等?
3)是否有更簡便的方法來計算像這樣的分組數據的均值,中位數,眾數等? 還是必須在Python中為這些函數創建自己的函數?
謝謝
對於您的垃圾箱和標簽問題,以下內容如何處理:
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
labels = ['{}-{}'.format(x, y-.1) for x, y in zip(bins[:], bins[1:])]
然后,而不是你的價值觀作為一個列表,使他們成為一個Series
sr = pd.Series([4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1,
5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7,
5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8])
ncut = pd.cut(sr, bins=bins, labels=labels, right=False)
定義一個lambda
函數來計算頻率
freq = lambda x: len(x) / x.sum()
freq.__name__ = 'freq'
最后,使用concat
, groupby
和agg
獲取每個bin的摘要統計信息
pd.concat([ncut, sr], axis=1).groupby(0).agg(['size', 'std', 'mean', freq])
我們試試吧:
l = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9,
5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6,
5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6,
6.7, 6.7, 6.8, 6.8]
s = pd.Series(l)
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
#Python 3.6+ f-string
labels = [f'{i}-{j-.1}' for i,j in zip(bins,bins[1:])]
(pd.concat([pd.cut(s, bins=bins, labels=labels, right=False),s],axis=1)
.groupby(0)[1]
.agg(['mean','median', pd.Series.mode, 'std'])
.rename_axis('categories')
.reset_index())
輸出:
categories mean median mode std
0 4.1-4.5 4.250000 4.25 4.1 0.151186
1 4.6-5.0 4.725000 4.70 4.6 0.150000
2 5.1-5.5 5.280000 5.30 5.3 0.131656
3 5.6-6.0 5.700000 5.65 5.6 0.126491
4 6.1-6.5 6.314286 6.30 6.2 0.121499
5 6.6-7.0 6.720000 6.70 [6.7, 6.8] 0.083666
我有點想辦法做到這一點:
def buildFreqTable(data, width, numclass, pw):
data.sort()
minrange = []
maxrange = []
x_med = []
count = []
# Since data is already sorted, take the lowest value to jumpstart the creation of ranges
f_data = data[0]
for i in range(0,numclass):
# minrange holds the minimum value for that row
minrange.append(f_data)
# maxrange holds the maximum value for that row
maxrange.append(f_data + (width - pw))
# Compute for range's median
minmax_median = (minrange[i] + maxrange[i]) / 2
x_med.append(minmax_median)
# initialize count per numclass to 0, this will be incremented later
count.append(0)
f_data = f_data + width
# Tally the frequencies
for x in data:
for i in range(0,6):
if (x>=minrange[i] and x<=maxrange[i]):
count[i] = count[i] + 1
# Now, create the pandas dataframe for easier manipulation
freqtable = pd.DataFrame()
freqtable['minrange'] = minrange
freqtable['maxrange'] = maxrange
freqtable['x'] = x_med
freqtable['count'] = count
buildFreqTable(sr, 0.5, 6, 0.1)
它散發出以下內容:
minrange maxrange x count
0 4.1 4.5 4.3 8
1 4.6 5.0 4.8 4
2 5.1 5.5 5.3 10
3 5.6 6.0 5.8 6
4 6.1 6.5 6.3 7
5 6.6 7.0 6.8 5
盡管我仍然好奇是否有更簡單的方法來執行此操作,或者是否有人可以將我的代碼重構為更“親”
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.