简体   繁体   中英

Calculating percentage of times column values meet varying conditions

I have a DataFrame like the one below, only with about 25 columns and 3000 rows. I need a second DF that displays the percentage of times that all the rows in each column from df_A are >= the target in df_B.

For example, in df_A, column d02 is >= .04 three times out of five (the len of the column), so that should be reflected in df_B as 60%.

I know how to do the comparison and percentages separately, but I am lost on how to put everything together and populate the new DF.

df_A        

     d01    d02    d03   
0  0.028  0.021  0.028    
1  0.051  0.063  0.093    
2  0.084  0.084  0.084     
3  0.061  0.061  0.072   
4  0.015  0.015  0.015

Goal...

df_B

  target    d01   d02   d03 
    .02     p     p     p
    .04     p    .60    p
    .06     p     p     p
    .08     p     p     p
    .15     p     p     p
    .20     p     p     p
    .25     p     p     p
    .30     p     p     p

Method

  • Create a list of targets.
  • Create a dictionary which will associate to each column name, the list of the percentages corresponding to the targets.
  • Loop on the targets, and for each target, loop on the columns to calculate the percentage and put it in the dictionary.
  • Create a DataFrame with the dictionary and the list of targets.

Code

df_A = pd.DataFrame(data = {
    "d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
    "d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
    "d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })

target = [.02, .04, .06, .08, .15, .20, .25, .30]

dic = {key: [] for key in df_A}

for t in target:
    for key in dic:
        s = 0
        for val in df_A[key]:
            if val >= t:
                s += 1
        dic[key].append(s / len(df_A[key]))

df_B = pd.DataFrame(data = dic, index = target)

Result on the example

df_B

      d01  d02  d03
0.02  0.8  0.8  0.8
0.04  0.6  0.6  0.6
0.06  0.4  0.6  0.6
0.08  0.2  0.2  0.4
0.15  0.0  0.0  0.0
0.20  0.0  0.0  0.0
0.25  0.0  0.0  0.0
0.30  0.0  0.0  0.0

Suppose you have (copied example data from Louis):

df_A = pd.DataFrame(data = {
    "d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
    "d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
    "d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })

target = [.02, .04, .06, .08, .15, .20, .25, .30]

df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)

You can use a lambda function to calculate the percentage.

df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()

Out[249]: 
   target  d01  d02  d03
0    0.02  0.8  0.8  0.8
1    0.04  0.6  0.6  0.6
2    0.06  0.4  0.6  0.6
3    0.08  0.2  0.2  0.4
4    0.15  0.0  0.0  0.0
5    0.20  0.0  0.0  0.0
6    0.25  0.0  0.0  0.0
7    0.30  0.0  0.0  0.0

One way is to use numpy :

a, t, n = df_A.values, df_T.values, len(df_A.index)
res = np.zeros((len(df_T.index), len(df_A.columns)))

for i in range(res.shape[0]):
    for j in range(res.shape[1]):
        res[i, j] = np.sum(a[:, j] >= t[i]) / n

result = df_T.join(pd.DataFrame(res, columns=df_A.columns))

Setup

df_A:

     d01    d02    d03
0  0.028  0.021  0.028
1  0.051  0.063  0.093
2  0.084  0.084  0.084
3  0.061  0.061  0.072
4  0.015  0.015  0.015

df_T:

   target
0    0.02
1    0.04
2    0.06
3    0.08
4    0.15
5    0.20
6    0.25
7    0.30

Result

   target  d01  d02  d03
0    0.02  0.8  0.8  0.8
1    0.04  0.6  0.6  0.6
2    0.06  0.4  0.6  0.6
3    0.08  0.2  0.2  0.4
4    0.15  0.0  0.0  0.0
5    0.20  0.0  0.0  0.0
6    0.25  0.0  0.0  0.0
7    0.30  0.0  0.0  0.0

Performance benchmarking

The numpy version can be further optimised using numba .

%timeit allen(df_A, target)  # 40ms
%timeit louis(df_A, target)  # 7.79ms
%timeit jpp(df_A, df_T)      # 4.29ms

df_A = pd.concat([df_A]*10)
df_T = pd.concat([df_T]*5)
target = [.02, .04, .06, .08, .15, .20, .25, .30] * 5

def allen(df_A, target):
    df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)
    return df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()

def jpp(df_A, df_T):
    a, t, n = df_A.values, df_T.values, len(df_A.index)
    res = np.zeros((len(df_T.index), len(df_A.columns)))

    for i in range(res.shape[0]):
        for j in range(res.shape[1]):
            res[i, j] = np.sum(a[:, j] >= t[i]) / n

    return df_T.join(pd.DataFrame(res, columns=df_A.columns))

def louis(df_A, target):
    dic = {key: [] for key in df_A}

    for t in target:
        for key in dic:
            s = 0
            for val in df_A[key]:
                if val >= t:
                    s += 1
            dic[key].append(s / len(df_A[key]))

    return pd.DataFrame(data = dic, index = target)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM