x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"),
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1
a b c
2017-01-01 1 NaN NaN
2017-01-02 1 NaN 1
2017-01-03 NaN NaN 1
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 1 NaN 1
2017-01-07 1 NaN NaN
2017-01-08 1 NaN NaN
2017-01-09 1 NaN NaN
2017-01-10 1 1 NaN
2017-01-11 NaN 1 NaN
2017-01-12 NaN 1 NaN
2017-01-13 NaN NaN NaN
Given the above dataframe, x
, I want to return the average of the number of occurences of 1s within each group of a
, b
, and c
. The average for each column is taken over the number of blocks that contains consecutive 1s.
For example, column a
will output the average of 2 and 5, which is 3.5. We divide it by 2 because there are 2 consecutive 1s between Jan-1 and Jan-2, then 5 consecutive 1s between Jan-06 and Jan-10, 2 blocks of 1s in total. Similarly, for column b
, we will have 3 because only one consecutive sequence of 1s occur once between Jan-10 and Jan-13. Finally, for column c
, we will have the average of 2 and 1, which is 1.5.
Expected output of the toy example:
a b c
3.5 3 1.5
Use mask
+ apply
with value_counts
, and finally, find the mean
of your counts -
x.eq(1)\
.ne(x.eq(1).shift())\
.cumsum(0)\
.mask(x.ne(1))\
.apply(pd.Series.value_counts)\
.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
Details
First, find a list of all consecutive values in your dataframe -
i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i
a b c
2017-01-01 1 1 1
2017-01-02 1 1 2
2017-01-03 2 1 2
2017-01-04 2 1 3
2017-01-05 2 1 3
2017-01-06 3 1 4
2017-01-07 3 1 5
2017-01-08 3 1 5
2017-01-09 3 1 5
2017-01-10 3 2 5
2017-01-11 4 2 5
2017-01-12 4 2 5
2017-01-13 4 3 5
Now, keep only those group values whose cells were originally 1
in x
-
j = i.mask(x.ne(1))
j
a b c
2017-01-01 1.0 NaN NaN
2017-01-02 1.0 NaN 2.0
2017-01-03 NaN NaN 2.0
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 3.0 NaN 4.0
2017-01-07 3.0 NaN NaN
2017-01-08 3.0 NaN NaN
2017-01-09 3.0 NaN NaN
2017-01-10 3.0 2.0 NaN
2017-01-11 NaN 2.0 NaN
2017-01-12 NaN 2.0 NaN
2017-01-13 NaN NaN NaN
Now, apply value_counts
across each column -
k = j.apply(pd.Series.value_counts)
k
a b c
1.0 2.0 NaN NaN
2.0 NaN 3.0 2.0
3.0 5.0 NaN NaN
4.0 NaN NaN 1.0
And just find the column-wise mean -
k.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
As a handy note, if you want to, for example, find the mean counts only for more than n
consecutive 1
s (say, n = 1
here), then you can filter on k
's index quite easily -
k[k.index > 1].mean(0)
a 5.0
b 3.0
c 1.5
dtype: float64
Let's try:
x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())
Output:
a 3.5
b 3.0
c 1.5
dtype: float64
Apply the lambda function to each column of the dataframe. The lambda function groups none 1 values together and counts them using sum() then takes the average using mean().
This utilizes cumsum
, shift
, and an xor
mask.
b = x.cumsum()
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]
b_masked.max() / b_masked.count()
a 3.5
b 3.0
c 1.5
dtype: float64
First do b = x.cumsum()
a b c
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 3.0 NaN 3.0
6 4.0 NaN NaN
7 5.0 NaN NaN
8 6.0 NaN NaN
9 7.0 1.0 NaN
10 NaN 2.0 NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Then, shift b
upward: c = b.shift(-1)
. Then, we create a xor mask with b.isnull() ^ c.isnull()
. This mask will only keep one value per consecutive ones. Note that it seems that it will create an extra True
at the end. But since we put it back to b
, where in the place it is NaN, it will not generate new elements. We use an example to illustrate
b c b.isnull() ^ c.isnull() b[b.isnull() ^ c.isnull()]
NaN 1 True NaN
1 2 False NaN
2 NaN True 2
NaN NaN False NaN
Real big b[b.isnull() ^ c.isnull()]
looks like
a b c
0 NaN NaN NaN
1 2.0 NaN NaN
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN 3.0
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 7.0 NaN NaN
10 NaN NaN NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Because we did cumsum
in the first place, we only need the maximum and the number of non-NaN in each column to calculate the mean.
Thus, we do b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()
you could use regex:
import re
p = r'1+'
counts = {
c: np.mean(
[len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
)
for c in ['a', 'b', 'c']
}
This method works because the columns here could be thought as expressions in a language with alphabet { 1
, nan
}. 1+
matches all groups of adjacent 1s and re.findall
returns a list of strings. Then, it is necessary to calculate the mean of the lengths of each string.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.