Calculating percentage of times column values meet varying conditions

Question

I have a DataFrame like the one below, only with about 25 columns and 3000 rows. I need a second DF that displays the percentage of times that all the rows in each column from df_A are >= the target in df_B.

For example, in df_A, column d02 is >= .04 three times out of five (the len of the column), so that should be reflected in df_B as 60%.

I know how to do the comparison and percentages separately, but I am lost on how to put everything together and populate the new DF.

df_A        

     d01    d02    d03   
0  0.028  0.021  0.028    
1  0.051  0.063  0.093    
2  0.084  0.084  0.084     
3  0.061  0.061  0.072   
4  0.015  0.015  0.015

Goal...

df_B

  target    d01   d02   d03 
    .02     p     p     p
    .04     p    .60    p
    .06     p     p     p
    .08     p     p     p
    .15     p     p     p
    .20     p     p     p
    .25     p     p     p
    .30     p     p     p

Answer 1

Method

Create a list of targets.
Create a dictionary which will associate to each column name, the list of the percentages corresponding to the targets.
Loop on the targets, and for each target, loop on the columns to calculate the percentage and put it in the dictionary.
Create a DataFrame with the dictionary and the list of targets.

Code

df_A = pd.DataFrame(data = {
    "d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
    "d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
    "d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })

target = [.02, .04, .06, .08, .15, .20, .25, .30]

dic = {key: [] for key in df_A}

for t in target:
    for key in dic:
        s = 0
        for val in df_A[key]:
            if val >= t:
                s += 1
        dic[key].append(s / len(df_A[key]))

df_B = pd.DataFrame(data = dic, index = target)

Result on the example

df_B

      d01  d02  d03
0.02  0.8  0.8  0.8
0.04  0.6  0.6  0.6
0.06  0.4  0.6  0.6
0.08  0.2  0.2  0.4
0.15  0.0  0.0  0.0
0.20  0.0  0.0  0.0
0.25  0.0  0.0  0.0
0.30  0.0  0.0  0.0

Answer 2

Suppose you have (copied example data from Louis):

df_A = pd.DataFrame(data = {
    "d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
    "d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
    "d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })

target = [.02, .04, .06, .08, .15, .20, .25, .30]

df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)

You can use a lambda function to calculate the percentage.

df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()

Out[249]: 
   target  d01  d02  d03
0    0.02  0.8  0.8  0.8
1    0.04  0.6  0.6  0.6
2    0.06  0.4  0.6  0.6
3    0.08  0.2  0.2  0.4
4    0.15  0.0  0.0  0.0
5    0.20  0.0  0.0  0.0
6    0.25  0.0  0.0  0.0
7    0.30  0.0  0.0  0.0

Answer 3

One way is to use numpy :

a, t, n = df_A.values, df_T.values, len(df_A.index)
res = np.zeros((len(df_T.index), len(df_A.columns)))

for i in range(res.shape[0]):
    for j in range(res.shape[1]):
        res[i, j] = np.sum(a[:, j] >= t[i]) / n

result = df_T.join(pd.DataFrame(res, columns=df_A.columns))

Setup

df_A:

     d01    d02    d03
0  0.028  0.021  0.028
1  0.051  0.063  0.093
2  0.084  0.084  0.084
3  0.061  0.061  0.072
4  0.015  0.015  0.015

df_T:

Result

   target  d01  d02  d03
0    0.02  0.8  0.8  0.8
1    0.04  0.6  0.6  0.6
2    0.06  0.4  0.6  0.6
3    0.08  0.2  0.2  0.4
4    0.15  0.0  0.0  0.0
5    0.20  0.0  0.0  0.0
6    0.25  0.0  0.0  0.0
7    0.30  0.0  0.0  0.0

Performance benchmarking

The numpy version can be further optimised using numba .

%timeit allen(df_A, target)  # 40ms
%timeit louis(df_A, target)  # 7.79ms
%timeit jpp(df_A, df_T)      # 4.29ms

df_A = pd.concat([df_A]*10)
df_T = pd.concat([df_T]*5)
target = [.02, .04, .06, .08, .15, .20, .25, .30] * 5

def allen(df_A, target):
    df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)
    return df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()

def jpp(df_A, df_T):
    a, t, n = df_A.values, df_T.values, len(df_A.index)
    res = np.zeros((len(df_T.index), len(df_A.columns)))

    for i in range(res.shape[0]):
        for j in range(res.shape[1]):
            res[i, j] = np.sum(a[:, j] >= t[i]) / n

    return df_T.join(pd.DataFrame(res, columns=df_A.columns))

def louis(df_A, target):
    dic = {key: [] for key in df_A}

    for t in target:
        for key in dic:
            s = 0
            for val in df_A[key]:
                if val >= t:
                    s += 1
            dic[key].append(s / len(df_A[key]))

    return pd.DataFrame(data = dic, index = target)

Calculating percentage of times column values meet varying conditions

Question

3 answers

solution1
1 2018-03-07 01:00:23

Method

Code

Result on the example

solution2
1 2018-03-07 01:12:31

solution3
1 ACCPTED 2018-03-07 01:19:34

Calculating percentage of times column values meet varying conditions

Question

3 answers

solution1 1 2018-03-07 01:00:23

Method

Code

Result on the example

solution2 1 2018-03-07 01:12:31

solution3 1 ACCPTED 2018-03-07 01:19:34

solution1
1 2018-03-07 01:00:23

solution2
1 2018-03-07 01:12:31

solution3
1 ACCPTED 2018-03-07 01:19:34