[英]Transform values different from a pre-defined range into missing and compute a new varible based on specific columns using str.contains and pandas
I have a dataset with missing cases (NAs) and "impossible" values, which is defined as any value different from 1,2,3,4, or 5.我有一个包含缺失案例 (NA) 和“不可能”值的数据集,它被定义为不同于 1、2、3、4 或 5 的任何值。
df = pd.DataFrame.from_dict({'aut_a_p1_r1': {131: 52.0, 106: 4.0, 80: 4.0, 108: 3.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 2.0, 9: 3.0, 248: 5.0, 219: 3.0, 134: 5.0, 105: 3.0, 176: 3.0, 245: 1.0, 271: 4.0, 249: 4.0}, 'aut_a_p1_r2': {131: 4.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 5.0, 248: 4.0, 219: 4.0, 134: 3.0, 105: 4.0, 176: 4.0, 245: 3.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r3': {131: 5.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 5.0, 9: 5.0, 248: 5.0, 219: 5.0, 134: 5.0, 105: 5.0, 176: 5.0, 245: 4.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r4': {131: 3.0, 106: 2.0, 80: 2.0, 108: 3.0, 303: 1.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 4.0, 248: 3.0, 219: 3.0, 134: 2.0, 105: 2.0, 176: 3.0, 245: 2.0, 271: 2.0, 249: 2.0}})
My goal is to convert all variables to numeric and then create a new column with the summation of other specific variables ignoring the missing cases.我的目标是将所有变量转换为数字,然后创建一个新列,其中包含其他特定变量的总和,忽略缺失的情况。 If the value of the cell is outside some pre-defined range, convert it to missing.
如果单元格的值超出某个预定义的范围,则将其转换为缺失值。 If possible, not sum columns in which missing (Nan) is present instead of return 0 as result.
如果可能,不要对存在缺失 (Nan) 的列求和,而不是返回 0 作为结果。
This is the code I´m trying:这是我正在尝试的代码:
#Convert to numeric
df.loc[:,df.columns.str.contains("aut_a_")] = df.loc[:,df.columns.str.contains("aut_a_")].apply(pd.to_numeric, errors='coerce')
# Convert values != 1,2,3,4,5 to missing
????
# Sum
df["aut_sum"] = df.loc[:,df.columns.str.contains("aut_a_")].sum(axis=1)
df["aut_sum"]
Please, feel free to improve my code.请随时改进我的代码。
You want to use filter
for searching a label in index/columns:您想使用
filter
在索引/列中搜索 label:
target = df.filter(regex='aut_a_*')
df['aut_sum'] = target.dropna().where(target.isin(np.arange(1, 6)), 0).sum(1)
Output: Output:
aut_a_p1_r1 aut_a_p1_r2 aut_a_p1_r3 aut_a_p1_r4 aut_sum
131 52.0 4.0 5.0 3.0 12.0
106 4.0 5.0 5.0 2.0 16.0
80 4.0 5.0 5.0 2.0 16.0
108 3.0 4.0 4.0 3.0 14.0
303 5.0 5.0 5.0 1.0 16.0
145 NaN NaN NaN NaN NaN
172 NaN NaN NaN NaN NaN
103 NaN NaN NaN NaN NaN
67 NaN NaN NaN NaN NaN
59 NaN NaN NaN NaN NaN
7 2.0 4.0 5.0 4.0 15.0
9 3.0 5.0 5.0 4.0 17.0
248 5.0 4.0 5.0 3.0 17.0
219 3.0 4.0 5.0 3.0 15.0
134 5.0 3.0 5.0 2.0 15.0
105 3.0 4.0 5.0 2.0 14.0
176 3.0 4.0 5.0 3.0 15.0
245 1.0 3.0 4.0 2.0 10.0
271 4.0 5.0 5.0 2.0 16.0
249 4.0 4.0 4.0 2.0 14.0
Try尝试
df["aut_sum"] = (df.applymap(lambda x: x if x in [1,2,3,4,5] else np.nan)
.filter(like="aut_a_").dropna().sum(axis=1) )
Use mask
:使用
mask
:
df['aut_sum'] = df.dropna(how='all').filter(like='aut_a_').mask((df < 1) | (df > 5)).sum(axis=1)
print(df)
# Output
aut_a_p1_r1 aut_a_p1_r2 aut_a_p1_r3 aut_a_p1_r4 aut_sum
131 52.0 4.0 5.0 3.0 12.0
106 4.0 5.0 5.0 2.0 16.0
80 4.0 5.0 5.0 2.0 16.0
108 3.0 4.0 4.0 3.0 14.0
303 5.0 5.0 5.0 1.0 16.0
145 NaN NaN NaN NaN NaN
172 NaN NaN NaN NaN NaN
103 NaN NaN NaN NaN NaN
67 NaN NaN NaN NaN NaN
59 NaN NaN NaN NaN NaN
7 2.0 4.0 5.0 4.0 15.0
9 3.0 5.0 5.0 4.0 17.0
248 5.0 4.0 5.0 3.0 17.0
219 3.0 4.0 5.0 3.0 15.0
134 5.0 3.0 5.0 2.0 15.0
105 3.0 4.0 5.0 2.0 14.0
176 3.0 4.0 5.0 3.0 15.0
245 1.0 3.0 4.0 2.0 10.0
271 4.0 5.0 5.0 2.0 16.0
249 4.0 4.0 4.0 2.0 14.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.