使用 str.contains 和 pandas 将不同于预定义范围的值转换为缺失值并根据特定列计算新变量

Question

I have a dataset with missing cases (NAs) and "impossible" values, which is defined as any value different from 1,2,3,4, or 5.我有一个包含缺失案例 (NA) 和“不可能”值的数据集，它被定义为不同于 1、2、3、4 或 5 的任何值。

df = pd.DataFrame.from_dict({'aut_a_p1_r1': {131: 52.0, 106: 4.0, 80: 4.0, 108: 3.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 2.0, 9: 3.0, 248: 5.0, 219: 3.0, 134: 5.0, 105: 3.0, 176: 3.0, 245: 1.0, 271: 4.0, 249: 4.0}, 'aut_a_p1_r2': {131: 4.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 5.0, 248: 4.0, 219: 4.0, 134: 3.0, 105: 4.0, 176: 4.0, 245: 3.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r3': {131: 5.0, 106: 5.0, 80: 5.0, 108: 4.0, 303: 5.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 5.0, 9: 5.0, 248: 5.0, 219: 5.0, 134: 5.0, 105: 5.0, 176: 5.0, 245: 4.0, 271: 5.0, 249: 4.0}, 'aut_a_p1_r4': {131: 3.0, 106: 2.0, 80: 2.0, 108: 3.0, 303: 1.0, 145: nan, 172: nan, 103: nan, 67: nan, 59: nan, 7: 4.0, 9: 4.0, 248: 3.0, 219: 3.0, 134: 2.0, 105: 2.0, 176: 3.0, 245: 2.0, 271: 2.0, 249: 2.0}})

My goal is to convert all variables to numeric and then create a new column with the summation of other specific variables ignoring the missing cases.我的目标是将所有变量转换为数字，然后创建一个新列，其中包含其他特定变量的总和，忽略缺失的情况。 If the value of the cell is outside some pre-defined range, convert it to missing.如果单元格的值超出某个预定义的范围，则将其转换为缺失值。 If possible, not sum columns in which missing (Nan) is present instead of return 0 as result.如果可能，不要对存在缺失 (Nan) 的列求和，而不是返回 0 作为结果。

This is the code I´m trying:这是我正在尝试的代码：

#Convert to numeric
df.loc[:,df.columns.str.contains("aut_a_")] = df.loc[:,df.columns.str.contains("aut_a_")].apply(pd.to_numeric, errors='coerce')

# Convert values != 1,2,3,4,5 to missing
????

# Sum
df["aut_sum"] = df.loc[:,df.columns.str.contains("aut_a_")].sum(axis=1)
df["aut_sum"]

Please, feel free to improve my code.请随时改进我的代码。

Answer 1

You want to use filter for searching a label in index/columns:您想使用filter在索引/列中搜索 label：

target = df.filter(regex='aut_a_*') 
df['aut_sum'] = target.dropna().where(target.isin(np.arange(1, 6)), 0).sum(1)

Output: Output：

    aut_a_p1_r1  aut_a_p1_r2  aut_a_p1_r3  aut_a_p1_r4  aut_sum
131         52.0          4.0          5.0          3.0     12.0
106          4.0          5.0          5.0          2.0     16.0
80           4.0          5.0          5.0          2.0     16.0
108          3.0          4.0          4.0          3.0     14.0
303          5.0          5.0          5.0          1.0     16.0
145          NaN          NaN          NaN          NaN      NaN
172          NaN          NaN          NaN          NaN      NaN
103          NaN          NaN          NaN          NaN      NaN
67           NaN          NaN          NaN          NaN      NaN
59           NaN          NaN          NaN          NaN      NaN
7            2.0          4.0          5.0          4.0     15.0
9            3.0          5.0          5.0          4.0     17.0
248          5.0          4.0          5.0          3.0     17.0
219          3.0          4.0          5.0          3.0     15.0
134          5.0          3.0          5.0          2.0     15.0
105          3.0          4.0          5.0          2.0     14.0
176          3.0          4.0          5.0          3.0     15.0
245          1.0          3.0          4.0          2.0     10.0
271          4.0          5.0          5.0          2.0     16.0
249          4.0          4.0          4.0          2.0     14.0

Answer 2

Try尝试

df["aut_sum"] = (df.applymap(lambda x: x if x in [1,2,3,4,5] else np.nan)
.filter(like="aut_a_").dropna().sum(axis=1) )

Answer 3

Use mask :使用mask ：

df['aut_sum'] = df.dropna(how='all').filter(like='aut_a_').mask((df < 1) | (df > 5)).sum(axis=1)
print(df)

# Output
     aut_a_p1_r1  aut_a_p1_r2  aut_a_p1_r3  aut_a_p1_r4  aut_sum
131         52.0          4.0          5.0          3.0     12.0
106          4.0          5.0          5.0          2.0     16.0
80           4.0          5.0          5.0          2.0     16.0
108          3.0          4.0          4.0          3.0     14.0
303          5.0          5.0          5.0          1.0     16.0
145          NaN          NaN          NaN          NaN      NaN
172          NaN          NaN          NaN          NaN      NaN
103          NaN          NaN          NaN          NaN      NaN
67           NaN          NaN          NaN          NaN      NaN
59           NaN          NaN          NaN          NaN      NaN
7            2.0          4.0          5.0          4.0     15.0
9            3.0          5.0          5.0          4.0     17.0
248          5.0          4.0          5.0          3.0     17.0
219          3.0          4.0          5.0          3.0     15.0
134          5.0          3.0          5.0          2.0     15.0
105          3.0          4.0          5.0          2.0     14.0
176          3.0          4.0          5.0          3.0     15.0
245          1.0          3.0          4.0          2.0     10.0
271          4.0          5.0          5.0          2.0     16.0
249          4.0          4.0          4.0          2.0     14.0

使用 str.contains 和 pandas 将不同于预定义范围的值转换为缺失值并根据特定列计算新变量

问题描述

3 个解决方案

解决方案1
2 2022-01-21 13:30:39

解决方案2
1 2022-01-21 13:29:48

解决方案3
1 已采纳 2022-01-21 13:35:51

使用 str.contains 和 pandas 将不同于预定义范围的值转换为缺失值并根据特定列计算新变量

问题描述

3 个解决方案

解决方案1 2 2022-01-21 13:30:39

解决方案2 1 2022-01-21 13:29:48

解决方案3 1 已采纳 2022-01-21 13:35:51

解决方案1
2 2022-01-21 13:30:39

解决方案2
1 2022-01-21 13:29:48

解决方案3
1 已采纳 2022-01-21 13:35:51