简体   繁体   English

替换数据集上的值并在 pandas 上逐行应用四分位规则

[英]Replace values on dataset and apply quartile rule by row on pandas

I have a dataset with lots of variables.我有一个包含很多变量的数据集。 So I've extracted the numeric ones:所以我提取了数字:

numeric_columns = transposed_df.select_dtypes(np.number)

Then I want to replace all 0 values for 0.0001然后我想将所有 0 值替换为 0.0001

transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)

And here is the first problem.这是第一个问题。 This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.此行不是用 0.0001 替换 0 值,而是用 0.0001 替换所有非零值

Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were.同样在此之后(将 0 值替换为 0.0001)我想将所有小于该行的第 1 个四分位数的值替换为 -1 并将其他值保持原样。 But I am not managing how.但我不管理如何。

To answer your first question回答你的第一个问题

In [36]: from pprint import pprint

In [37]: pprint( numeric_columns.where.__doc__)
('\n'
 'Replace values where the condition is False.\n'
 '\n'
 'Parameters\n'
 '----------\n'

because of that your all the values except 0 are getting replaced因为你除了 0 之外的所有值都被替换了

Use DataFrame.mask and for second condition compare by DataFrame.quantile :使用DataFrame.mask和第二个条件比较DataFrame.quantile

transposed_df = pd.DataFrame({
        'A':list('abcdef'),
         'B':[0,0.5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,0,7,1,0],
         'E':[5,3,6,9,2,4],
         'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)

print (transposed_df)
   A    B  C    D  E  F
0  a -1.0  7  1.0  5  a
1  b -1.0  8  3.0  3  a
2  c  4.0  9 -1.0  6  a
3  d  5.0 -1  7.0  9  b
4  e  5.0  2 -1.0  2  b
5  f  4.0  3 -1.0  4  b

EDIT:编辑:

from  scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
          B         C         D         E
0 -2.236068  0.570352 -0.408248  0.073521
1  0.447214  0.950586  0.408248 -0.808736
2  0.447214  1.330821 -0.816497  0.514650
3  0.447214 -0.570352  2.041241  1.838037
4  0.447214 -1.330821 -0.408248 -1.249865
5  0.447214 -0.950586 -0.816497 -0.367607

EDIT1:编辑1:

transposed_df = pd.DataFrame({
        'A':list('abcdef'),
         'B':[0,1,1,1,1,1],
         'C':[1,8,9,4,2,3],
         'D':[1,3,0,7,1,0],
         'E':[1,3,6,9,2,4],
         'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)

from  scipy.stats import zscore

df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
   A         B         C         D         E  F
0  a -1.732051  0.577350  0.577350  0.577350  a
1  b -1.063410  1.643452 -0.290021 -0.290021  a
2  c -0.816497  1.360828 -1.088662  0.544331  a
3  d -1.402136 -0.412393  0.577350  1.237179  b
4  e -1.000000  1.000000 -1.000000  1.000000  b
5  f -0.632456  0.632456 -1.264911  1.264911  b

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM