简体   繁体   中英

How can I get the count of consecutive positive number in each column in 2 dimensional df in python/ Padas

      X     y
a   1.0  -1.0
b  -2.0   2.0
c   3.0  -3.0
d   2.1   4.0

Output: 
       x     y
  a   1.0  -1.0
  b  -2.0   2.0
  c   3.0  -3.0
  d   2.1   4.0
Count 2     1

As on the first column, the count is reset to 0 on row b because of -2. The result needs to be a df with the count appended at last.

Let us use cumsum def your function

def yourfun(x) : 
       return x[x.ge(0)].groupby(x.lt(0).cumsum()).size().iloc[-1]
df.loc['Count'] = df.apply(yourfun)
df
Out[62]: 
         X    y
a      1.0 -1.0
b     -2.0  2.0
c      3.0 -3.0
d      2.1  4.0
Count  2.0  1.0

There is a pure numpy way without groupby (in other words: likely to be very fast). It also counts runs of strictly positive values (excluding 0):

def countpos(x):
    return np.diff(np.where(np.hstack((-1, x, -1)) <= 0)[0]).max() - 1

df.loc['Count'] = df.apply(countpos)

Result:

>>> df
         X    y
a      1.0 -1.0
b     -2.0  2.0
c      3.0 -3.0
d      2.1  4.0
Count  2.0  1.0

Explanation

The np.where() looks for the indices of all non-positive values. For example:

>>> np.where(np.array([0,0,1,1,0,0,1]) <= 0)
(array([0, 1, 4, 5]),)

We bracked the actual values with -1 on both sides, to force np.where() to tell us about those indices too. Then, take the diff , max , et voila: the maximum length of runs of strictly positive numbers.

Speed

df = pd.DataFrame(np.random.uniform(-1, 1, size=(10000,100)))

a = %timeit -o df.apply(countpos)
14.3 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Here is another way:

df.loc['count'] =  df.lt(0).diff().ne(0).cumsum().stack().groupby(level=1).value_counts().groupby(level=0).max()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM