重新采样熊猫数据框并应用模式

Question

I would like to calculate mode for each group of resampled rows in pandas dataframe. 我想为熊猫数据框中的每组重采样行计算模式。 I try it like so: 我这样尝试：

import datetime
import pandas as pd
import numpy as np
from statistics import mode


date_times = pd.date_range(datetime.datetime(2012, 4, 5),
                           datetime.datetime(2013, 4, 5),
                           freq='D')
a = np.random.sample(date_times.size) * 10.0

frame = pd.DataFrame(data={'a': a},
                     index=date_times)

frame['b'] = np.random.randint(1, 3, frame.shape[0])
frame.resample("M").apply({'a':'sum', 'b':'mode'})

But it doesnt work. 但它不起作用。

I also try: 我也尝试：

frame.resample("M").apply({'a':'sum', 'b':lambda x: mode(frame['b'])})

But I get wrong results. 但是我得到了错误的结果。 Any ideas? 有任何想法吗？

Thanks. 谢谢。

Answer 1

In frame.resample("M").apply({'a':'sum', 'b':lambda x: mode(frame['b'])}) the lambda function is called once for each resampling group. 在frame.resample("M").apply({'a':'sum', 'b':lambda x: mode(frame['b'])}) ，每个重采样组都会调用一次lambda函数。 x is assigned to a Series whose values are from the b column of the resampling group. x分配给一个系列，其值来自重采样组的b列。

lambda x: mode(frame['b']) ignores x and simply returns the mode of frame['b'] -- the entire column. lambda x: mode(frame['b'])忽略x并仅返回frame['b']的模式-整列。

Instead, you would want something like 相反，您会想要类似

frame.resample("M").apply({'a':'sum', 'b':lambda x: mode(x)})

However, this leads to a StatisticsError 但是，这会导致StatisticsError

StatisticsError: no unique mode; found 2 equally common values

since there is a resampling group with more than one most common value. 因为有一个重采样组具有多个以上的最常用值。

If you use scipy.stats.mode instead, then the smallest such most-common value is returned: 如果改用scipy.stats.mode ，则返回最小的此类最常用值：

import datetime
import pandas as pd
import numpy as np
import scipy.stats as stats

date_times = pd.date_range(datetime.datetime(2012, 4, 5),
                           datetime.datetime(2013, 4, 5),
                           freq='D')
a = np.random.sample(date_times.size) * 10.0
frame = pd.DataFrame(data={'a': a}, index=date_times)
frame['b'] = np.random.randint(1, 3, frame.shape[0])

result = frame.resample("M").apply({'a':'sum', 'b':lambda x: stats.mode(x)[0]})
print(result)

yields 产量

            b           a
2012-04-30  2  132.708704
2012-05-31  2  149.103439
2012-06-30  2  128.492203
2012-07-31  2  142.167672
2012-08-31  2  126.516689
2012-09-30  1  133.209314
2012-10-31  2  136.684212
2012-11-30  2  165.075150
2012-12-31  2  167.064212
2013-01-31  1  150.293293
2013-02-28  1  125.533830
2013-03-31  2  174.236113
2013-04-30  2   11.254136

If you want the largest most-common value, then, unfortunately, I don't know of any builtin function which does this for you. 如果您想要最大的最常用值，那么，不幸的是，我不知道有任何内置函数可以为您完成此任务。 In this case you might have to compute a value_counts table: 在这种情况下，您可能必须计算一个value_counts表：

In [89]: counts
Out[89]: 
            b  counts
2012-04-30  3      11
2012-04-30  2      10
2012-04-30  1       5
2012-05-31  2      14
2012-05-31  1       9
2012-05-31  3       8

Then sort it in descending order by both counts and b value, group by the date and take the first value in each group: 然后将其按 counts和b值降序排列，按日期分组，并取每组中的第一个值：

import datetime as DT
import numpy as np
import scipy.stats as stats
import pandas as pd
np.random.seed(2018)

date_times = pd.date_range(DT.datetime(2012, 4, 5), DT.datetime(2013, 4, 5), freq='D')
N = date_times.size
a = np.random.sample(N) * 10.0
frame = pd.DataFrame(data={'a': a, 'b': np.random.randint(1, 4, N)}, index=date_times)

resampled = frame.resample("M")
sums = resampled['a'].sum()
counts = resampled['b'].value_counts()
counts.name = 'counts'
counts = counts.reset_index(level=1)
counts = counts.sort_values(by=['counts','b'], 
                             ascending=[False,False])
result = counts.groupby(level=0).first()

yields 产量

            b  counts
2012-04-30  3      11
2012-05-31  2      14
2012-06-30  3      12
2012-07-31  2      12
2012-08-31  2      11
2012-09-30  3      12
2012-10-31  2      13
2012-11-30  3      13
2012-12-31  2      14
2013-01-31  3      14
2013-02-28  1      10
2013-03-31  3      13
2013-04-30  3       2

重新采样熊猫数据框并应用模式

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-01-25 17:07:49

重新采样熊猫数据框并应用模式

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-01-25 17:07:49

解决方案1
2 已采纳 2018-01-25 17:07:49