[英]Resampling of categorical column in pandas data frame
我需要一些幫助來解決這個問題。 一直在嘗試一些事情,但沒有工作。 我有一個如下所示的 pandas 數據框(最后): 數據不定期提供(頻率不固定)。 我希望以固定頻率對數據進行采樣,例如每 1 分鍾一次。 如果該列是浮點數,則意味着每 1 分鍾一次可以正常工作
df1.resample('1T',base = 1).mean()
但由於數據是分類平均值沒有意義,我也嘗試了 sum ,這對采樣也沒有意義。 基本上我需要的是在 1 分鍾采樣時列的最大計數為此,我使用以下代碼將自定義 function 應用於重采樣時 1 分鍾內的值。 .
def custome_mod(arraylike):
vals, counts = np.unique(arraylike, return_counts=True)
return (np.argwhere(counts == np.max(counts)))
df1.resample('1T',base = 1).apply(custome_mod)
我期望的 output 是:每 1 分鍾可用的數據幀和該 1 分鍾內數據的最大計數值。 由於某種原因,它似乎不起作用並給我錯誤。 一直在嘗試調試很長時間。 有人可以提供一些輸入/代碼檢查嗎?
我得到的錯誤如下:
ValueError: zero-size array to reduction operation maximum which has no identity
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
264 try:
--> 265 return self._python_agg_general(func, *args, **kwargs)
266 except (ValueError, KeyError):
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
935
--> 936 result, counts = self.grouper.agg_series(obj, f)
937 assert result is not None
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
862 grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863 return grouper.get_result()
864
pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()
pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()
ValueError: Function does not reduce
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
358 # Check if the function is reducing or not.
--> 359 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
360 else:
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
1171 try:
-> 1172 result[item] = colg.aggregate(func, *args, **kwargs)
1173
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
268 # see see test_groupby.test_basic
--> 269 result = self._aggregate_named(func, *args, **kwargs)
270
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
453 if isinstance(output, (Series, Index, np.ndarray)):
--> 454 raise ValueError("Must produce aggregated value")
455 result[name] = output
ValueError: Must produce aggregated value
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
283 how = func
284 grouper = None
--> 285 result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
286
287 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
380 # we have a non-reducing function
381 # try to evaluate
--> 382 result = grouped.apply(how, *args, **kwargs)
383
384 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
733 with option_context("mode.chained_assignment", None):
734 try:
--> 735 result = self._python_apply_general(f)
736 except TypeError:
737 # gh-20949
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
749
750 def _python_apply_general(self, f):
--> 751 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
752
753 return self._wrap_applied_output(
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
204 # group might be modified
205 group_axes = group.axes
--> 206 res = f(group)
207 if not _is_indexed_like(res, group_axes):
208 mutated = True
<command-36984414005658> in custome_mod(arraylike)
1 def custome_mod(arraylike):
2 vals, counts = np.unique(arraylike, return_counts=True)
----> 3 return (np.argwhere(counts == np.max(counts)))
<__array_function__ internals> in amax(*args, **kwargs)
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
2666 """
2667 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668 keepdims=keepdims, initial=initial, where=where)
2669
2670
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
88 return reduction(axis=axis, out=out, **passkwargs)
89
---> 90 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
91
92
ValueError: zero-size array to reduction operation maximum which has no identity
樣品 Dataframe 和預期 Output
樣本 Df
6/3/2021 1:19:05 0
6/3/2021 1:19:15 1
6/3/2021 1:19:26 1
6/3/2021 1:19:38 1
6/3/2021 1:20:06 0
6/3/2021 1:20:16 0
6/3/2021 1:20:36 1
6/3/2021 1:21:09 1
6/3/2021 1:21:19 1
6/3/2021 1:21:45 0
6/4/2021 1:19:15 0
6/4/2021 1:19:25 0
6/4/2021 1:19:36 0
6/4/2021 1:19:48 1
6/4/2021 1:22:26 1
6/4/2021 1:22:36 0
6/4/2021 1:22:46 0
6/5/2021 2:20:19 0
6/5/2021 2:20:21 1
6/5/2021 2:20:40 0
預計 Output
6/3/2021 1:19 1
6/3/2021 1:20 0
6/3/2021 1:21 1
6/4/2021 1:19 0
6/4/2021 1:22 0
6/5/2021 2:20 0
請注意,原始數據幀具有不規則頻率的數據(有時每 5 秒 20 秒等)。預期的 output 也顯示在上面 - 每 1 分鍾需要數據(重新采樣到每分鍾而不是原始不規則秒),並且分類列應該有該分鍾內出現頻率最高的值,例如:19分鍾的原始數據中有四個數據點,其中最頻繁的值為1。同樣在20分鍾的原始數據中有三個數據點,頻率最高的為0。同樣21分鍾有3個數據點,最頻繁的是1。另外我正在工作的數據有2000萬行。希望對你有幫助,這是減少數據維度的努力。
在預期 output 之后,我會按列進行分組並計數。 這個計數將以分鍾為單位,我將能夠知道該列是 1 多長時間(及時)
您是否正在尋找類似的東西:
# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
category
2021-06-03 6
2021-06-04 2
2021-06-06 1
2021-06-08 1
2021-06-25 1
2021-06-29 6
2021-06-30 3
# OR
>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
category
2021-06-03 2
2021-06-04 0
2021-06-06 1
2021-06-08 1
2021-06-25 0
2021-06-29 3
2021-06-30 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.