簡體   English   中英

pandas 數據幀中分類列的重采樣

[英]Resampling of categorical column in pandas data frame

我需要一些幫助來解決這個問題。 一直在嘗試一些事情,但沒有工作。 我有一個如下所示的 pandas 數據框(最后): 數據不定期提供(頻率不固定)。 我希望以固定頻率對數據進行采樣,例如每 1 分鍾一次。 如果該列是浮點數,則意味着每 1 分鍾一次可以正常工作

df1.resample('1T',base = 1).mean()

但由於數據是分類平均值沒有意義,我也嘗試了 sum ,這對采樣也沒有意義。 基本上我需要的是在 1 分鍾采樣時列的最大計數為此,我使用以下代碼將自定義 function 應用於重采樣時 1 分鍾內的值。 .

    def custome_mod(arraylike):
      vals, counts = np.unique(arraylike, return_counts=True)
  return (np.argwhere(counts == np.max(counts)))

df1.resample('1T',base = 1).apply(custome_mod) 

我期望的 output 是:每 1 分鍾可用的數據幀和該 1 分鍾內數據的最大計數值。 由於某種原因,它似乎不起作用並給我錯誤。 一直在嘗試調試很長時間。 有人可以提供一些輸入/代碼檢查嗎?

我得到的錯誤如下:

ValueError: zero-size array to reduction operation maximum which has no identity

ValueError                                Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    264             try:
--> 265                 return self._python_agg_general(func, *args, **kwargs)
    266             except (ValueError, KeyError):

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
    935 
--> 936             result, counts = self.grouper.agg_series(obj, f)
    937             assert result is not None

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
    862         grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863         return grouper.get_result()
    864 

pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()

pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()

pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()

ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
    358                 # Check if the function is reducing or not.
--> 359                 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
    360             else:

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
   1171             try:
-> 1172                 result[item] = colg.aggregate(func, *args, **kwargs)
   1173 

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
    268                 #  see see test_groupby.test_basic
--> 269                 result = self._aggregate_named(func, *args, **kwargs)
    270 

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
    453             if isinstance(output, (Series, Index, np.ndarray)):
--> 454                 raise ValueError("Must produce aggregated value")
    455             result[name] = output

ValueError: Must produce aggregated value

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)

/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
    283             how = func
    284             grouper = None
--> 285             result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
    286 
    287         result = self._apply_loffset(result)

/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
    380             # we have a non-reducing function
    381             # try to evaluate
--> 382             result = grouped.apply(how, *args, **kwargs)
    383 
    384         result = self._apply_loffset(result)

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    733         with option_context("mode.chained_assignment", None):
    734             try:
--> 735                 result = self._python_apply_general(f)
    736             except TypeError:
    737                 # gh-20949

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    749 
    750     def _python_apply_general(self, f):
--> 751         keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
    752 
    753         return self._wrap_applied_output(

/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    204             # group might be modified
    205             group_axes = group.axes
--> 206             res = f(group)
    207             if not _is_indexed_like(res, group_axes):
    208                 mutated = True

<command-36984414005658> in custome_mod(arraylike)
      1 def custome_mod(arraylike):
      2   vals, counts = np.unique(arraylike, return_counts=True)
----> 3   return (np.argwhere(counts == np.max(counts)))

<__array_function__ internals> in amax(*args, **kwargs)

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
   2666     """
   2667     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668                           keepdims=keepdims, initial=initial, where=where)
   2669 
   2670 

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     88                 return reduction(axis=axis, out=out, **passkwargs)
     89 
---> 90     return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
     91 
     92 

ValueError: zero-size array to reduction operation maximum which has no identity

樣品 Dataframe 和預期 Output

樣本 Df

6/3/2021 1:19:05    0
6/3/2021 1:19:15    1
6/3/2021 1:19:26    1
6/3/2021 1:19:38    1
6/3/2021 1:20:06    0
6/3/2021 1:20:16    0
6/3/2021 1:20:36    1
6/3/2021 1:21:09    1
6/3/2021 1:21:19    1
6/3/2021 1:21:45    0
6/4/2021 1:19:15    0
6/4/2021 1:19:25    0
6/4/2021 1:19:36    0
6/4/2021 1:19:48    1
6/4/2021 1:22:26    1
6/4/2021 1:22:36    0
6/4/2021 1:22:46    0
6/5/2021 2:20:19    0
6/5/2021 2:20:21    1
6/5/2021 2:20:40    0

預計 Output

6/3/2021 1:19   1
6/3/2021 1:20   0
6/3/2021 1:21   1
6/4/2021 1:19   0
6/4/2021 1:22   0
6/5/2021 2:20   0

請注意,原始數據幀具有不規則頻率的數據(有時每 5 秒 20 秒等)。預期的 output 也顯示在上面 - 每 1 分鍾需要數據(重新采樣到每分鍾而不是原始不規則秒),並且分類列應該有該分鍾內出現頻率最高的值,例如:19分鍾的原始數據中有四個數據點,其中最頻繁的值為1。同樣在20分鍾的原始數據中有三個數據點,頻率最高的為0。同樣21分鍾有3個數據點,最頻繁的是1。另外我正在工作的數據有2000萬行。希望對你有幫助,這是減少數據維度的努力。

在預期 output 之后,我會按列進行分組並計數。 這個計數將以分鍾為單位,我將能夠知道該列是 1 多長時間(及時)

您是否正在尋找類似的東西:

# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
            category
2021-06-03         6
2021-06-04         2
2021-06-06         1
2021-06-08         1
2021-06-25         1
2021-06-29         6
2021-06-30         3

# OR

>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
            category
2021-06-03         2
2021-06-04         0
2021-06-06         1
2021-06-08         1
2021-06-25         0
2021-06-29         3
2021-06-30         1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM