I have a DataFrame like the one below, only with about 25 columns and 3000 rows. I need a second DF that displays the percentage of times that all the rows in each column from df_A are >= the target in df_B.
For example, in df_A, column d02 is >= .04 three times out of five (the len of the column), so that should be reflected in df_B as 60%.
I know how to do the comparison and percentages separately, but I am lost on how to put everything together and populate the new DF.
df_A
d01 d02 d03
0 0.028 0.021 0.028
1 0.051 0.063 0.093
2 0.084 0.084 0.084
3 0.061 0.061 0.072
4 0.015 0.015 0.015
Goal...
df_B
target d01 d02 d03
.02 p p p
.04 p .60 p
.06 p p p
.08 p p p
.15 p p p
.20 p p p
.25 p p p
.30 p p p
DataFrame
with the dictionary and the list of targets. df_A = pd.DataFrame(data = {
"d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
"d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
"d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })
target = [.02, .04, .06, .08, .15, .20, .25, .30]
dic = {key: [] for key in df_A}
for t in target:
for key in dic:
s = 0
for val in df_A[key]:
if val >= t:
s += 1
dic[key].append(s / len(df_A[key]))
df_B = pd.DataFrame(data = dic, index = target)
df_B
d01 d02 d03
0.02 0.8 0.8 0.8
0.04 0.6 0.6 0.6
0.06 0.4 0.6 0.6
0.08 0.2 0.2 0.4
0.15 0.0 0.0 0.0
0.20 0.0 0.0 0.0
0.25 0.0 0.0 0.0
0.30 0.0 0.0 0.0
Suppose you have (copied example data from Louis):
df_A = pd.DataFrame(data = {
"d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
"d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
"d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })
target = [.02, .04, .06, .08, .15, .20, .25, .30]
df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)
You can use a lambda function to calculate the percentage.
df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()
Out[249]:
target d01 d02 d03
0 0.02 0.8 0.8 0.8
1 0.04 0.6 0.6 0.6
2 0.06 0.4 0.6 0.6
3 0.08 0.2 0.2 0.4
4 0.15 0.0 0.0 0.0
5 0.20 0.0 0.0 0.0
6 0.25 0.0 0.0 0.0
7 0.30 0.0 0.0 0.0
One way is to use numpy
:
a, t, n = df_A.values, df_T.values, len(df_A.index)
res = np.zeros((len(df_T.index), len(df_A.columns)))
for i in range(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sum(a[:, j] >= t[i]) / n
result = df_T.join(pd.DataFrame(res, columns=df_A.columns))
Setup
df_A:
d01 d02 d03
0 0.028 0.021 0.028
1 0.051 0.063 0.093
2 0.084 0.084 0.084
3 0.061 0.061 0.072
4 0.015 0.015 0.015
df_T:
target
0 0.02
1 0.04
2 0.06
3 0.08
4 0.15
5 0.20
6 0.25
7 0.30
Result
target d01 d02 d03
0 0.02 0.8 0.8 0.8
1 0.04 0.6 0.6 0.6
2 0.06 0.4 0.6 0.6
3 0.08 0.2 0.2 0.4
4 0.15 0.0 0.0 0.0
5 0.20 0.0 0.0 0.0
6 0.25 0.0 0.0 0.0
7 0.30 0.0 0.0 0.0
Performance benchmarking
The numpy
version can be further optimised using numba
.
%timeit allen(df_A, target) # 40ms
%timeit louis(df_A, target) # 7.79ms
%timeit jpp(df_A, df_T) # 4.29ms
df_A = pd.concat([df_A]*10)
df_T = pd.concat([df_T]*5)
target = [.02, .04, .06, .08, .15, .20, .25, .30] * 5
def allen(df_A, target):
df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)
return df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()
def jpp(df_A, df_T):
a, t, n = df_A.values, df_T.values, len(df_A.index)
res = np.zeros((len(df_T.index), len(df_A.columns)))
for i in range(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sum(a[:, j] >= t[i]) / n
return df_T.join(pd.DataFrame(res, columns=df_A.columns))
def louis(df_A, target):
dic = {key: [] for key in df_A}
for t in target:
for key in dic:
s = 0
for val in df_A[key]:
if val >= t:
s += 1
dic[key].append(s / len(df_A[key]))
return pd.DataFrame(data = dic, index = target)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.