简体   繁体   中英

How to count occurrences of consecutive 1s by column and take mean by block

x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"), 
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1

            a   b   c
2017-01-01  1   NaN NaN
2017-01-02  1   NaN 1
2017-01-03  NaN NaN 1
2017-01-04  NaN NaN NaN
2017-01-05  NaN NaN NaN
2017-01-06  1   NaN 1
2017-01-07  1   NaN NaN
2017-01-08  1   NaN NaN
2017-01-09  1   NaN NaN
2017-01-10  1   1   NaN
2017-01-11  NaN 1   NaN
2017-01-12  NaN 1   NaN
2017-01-13  NaN NaN NaN

Given the above dataframe, x , I want to return the average of the number of occurences of 1s within each group of a , b , and c . The average for each column is taken over the number of blocks that contains consecutive 1s.

For example, column a will output the average of 2 and 5, which is 3.5. We divide it by 2 because there are 2 consecutive 1s between Jan-1 and Jan-2, then 5 consecutive 1s between Jan-06 and Jan-10, 2 blocks of 1s in total. Similarly, for column b , we will have 3 because only one consecutive sequence of 1s occur once between Jan-10 and Jan-13. Finally, for column c , we will have the average of 2 and 1, which is 1.5.

Expected output of the toy example:

a    b  c
3.5  3  1.5

Use mask + apply with value_counts , and finally, find the mean of your counts -

x.eq(1)\
 .ne(x.eq(1).shift())\
 .cumsum(0)\
 .mask(x.ne(1))\
 .apply(pd.Series.value_counts)\
 .mean(0)

a    3.5
b    3.0
c    1.5
dtype: float64

Details

First, find a list of all consecutive values in your dataframe -

i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i

            a  b  c
2017-01-01  1  1  1
2017-01-02  1  1  2
2017-01-03  2  1  2
2017-01-04  2  1  3
2017-01-05  2  1  3
2017-01-06  3  1  4
2017-01-07  3  1  5
2017-01-08  3  1  5
2017-01-09  3  1  5
2017-01-10  3  2  5
2017-01-11  4  2  5
2017-01-12  4  2  5
2017-01-13  4  3  5

Now, keep only those group values whose cells were originally 1 in x -

j = i.mask(x.ne(1))
j

              a    b    c
2017-01-01  1.0  NaN  NaN
2017-01-02  1.0  NaN  2.0
2017-01-03  NaN  NaN  2.0
2017-01-04  NaN  NaN  NaN
2017-01-05  NaN  NaN  NaN
2017-01-06  3.0  NaN  4.0
2017-01-07  3.0  NaN  NaN
2017-01-08  3.0  NaN  NaN
2017-01-09  3.0  NaN  NaN
2017-01-10  3.0  2.0  NaN
2017-01-11  NaN  2.0  NaN
2017-01-12  NaN  2.0  NaN
2017-01-13  NaN  NaN  NaN

Now, apply value_counts across each column -

k = j.apply(pd.Series.value_counts)
k


       a    b    c
1.0  2.0  NaN  NaN
2.0  NaN  3.0  2.0
3.0  5.0  NaN  NaN
4.0  NaN  NaN  1.0

And just find the column-wise mean -

k.mean(0)

a    3.5
b    3.0
c    1.5
dtype: float64

As a handy note, if you want to, for example, find the mean counts only for more than n consecutive 1 s (say, n = 1 here), then you can filter on k 's index quite easily -

k[k.index > 1].mean(0)

a    5.0
b    3.0
c    1.5
dtype: float64

Let's try:

x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())

Output:

a    3.5
b    3.0
c    1.5
dtype: float64

Apply the lambda function to each column of the dataframe. The lambda function groups none 1 values together and counts them using sum() then takes the average using mean().

This utilizes cumsum , shift , and an xor mask.

b = x.cumsum()  
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]

b_masked.max() / b_masked.count()

a    3.5
b    3.0
c    1.5
dtype: float64

First do b = x.cumsum()

    a       b       c
0   1.0     NaN     NaN
1   2.0     NaN     1.0
2   NaN     NaN     2.0
3   NaN     NaN     NaN
4   NaN     NaN     NaN
5   3.0     NaN     3.0
6   4.0     NaN     NaN
7   5.0     NaN     NaN
8   6.0     NaN     NaN
9   7.0     1.0     NaN
10  NaN     2.0     NaN
11  NaN     3.0     NaN
12  NaN     NaN     NaN

Then, shift b upward: c = b.shift(-1) . Then, we create a xor mask with b.isnull() ^ c.isnull() . This mask will only keep one value per consecutive ones. Note that it seems that it will create an extra True at the end. But since we put it back to b , where in the place it is NaN, it will not generate new elements. We use an example to illustrate

 b   c   b.isnull() ^ c.isnull()    b[b.isnull() ^ c.isnull()]
NaN  1         True                          NaN
 1   2         False                         NaN
 2  NaN        True                          2
NaN NaN        False                         NaN

Real big b[b.isnull() ^ c.isnull()] looks like

    a       b        c
0   NaN     NaN     NaN
1   2.0     NaN     NaN
2   NaN     NaN     2.0
3   NaN     NaN     NaN
4   NaN     NaN     NaN
5   NaN     NaN     3.0
6   NaN     NaN     NaN
7   NaN     NaN     NaN
8   NaN     NaN     NaN
9   7.0     NaN     NaN
10  NaN     NaN     NaN
11  NaN     3.0     NaN
12  NaN     NaN     NaN

Because we did cumsum in the first place, we only need the maximum and the number of non-NaN in each column to calculate the mean.

Thus, we do b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()

you could use regex:

import re

p = r'1+'

counts = {
    c: np.mean(
        [len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
        )
    for c in ['a', 'b', 'c']
}

This method works because the columns here could be thought as expressions in a language with alphabet { 1 , nan }. 1+ matches all groups of adjacent 1s and re.findall returns a list of strings. Then, it is necessary to calculate the mean of the lengths of each string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM