多个 if/then 评估数据框中列的值并有条件地修改另一列值的最佳方法是什么？

Question

I have a dataframe (1580 rows x 48 columns) where each column contains answers to questions, but not every row contains an answer to every question (leaving it NaN ).我有一个数据框（1580 行 x 48 列），其中每一列都包含问题的答案，但并非每一行都包含每个问题的答案（保留为NaN ）。 Groups of questions are related, and I'd like to tabulate the answers to the group of questions into new columns ( c_answers and i_answers ).问题组是相关的，我想将这组问题的答案制成新列（ c_answers和i_answers ）。 I have generated lists of the correct answers for each group of questions.我已经为每组问题生成了正确答案的列表。 Here is an example of the data:以下是数据示例：

ex_df = pd.DataFrame([["a", "b", "d"],[np.nan, "a", "b"], ["c", "e", np.nan]], columns=["q1", "q2", "q3"])
correct_answers = ["a", "b", "c"]
ex_df

which generates the following dataframe:生成以下数据框：

    q1   q2   q3
0   a    b    d
1  NaN   a    b
2   e    c   NaN

What I would like to do, ideally, is to create a function that would score each column, and for each correct answer on a row (appears in the correct_answers list) it would increment a c_answers column by 1, for each answer that is not in correct_answers , it would increment a i_answers column by 1 instead, but if the provided answer is NaN , it would do neither (not counted as correct or incorrect).理想情况下，我想做的是创建一个对每一列进行评分的函数，并且对于一行中的每个正确答案（出现在正确答案列表中），对于每个不是的答案，它c_answers correct_answers增加 1在correct_answers中，它会将i_answers列增加 1，但如果提供的答案是NaN ，则两者都不做（不计为正确或不正确）。 This function could then be applied to each group of questions, calculating the number of correct and incorrect answers for each row, for that group.然后可以将该函数应用于每组问题，计算该组每行正确和错误答案的数量。

What I have been able to make a bit of progress with instead is something like this:相反，我能够取得一些进展的是这样的：

ex_df['q1score'] = np.where(ex_df['q1'].isna(), np.nan, 
                          np.where(ex_df['q1'].isin(correct_answers), 1, 100))

which updates the dataframe like so:它像这样更新数据框：

    q1   q2   q3   q1score
0   a    b    d    1.0
1  NaN   a    b    NaN
2   e    c   NaN   100.0

I could then re-use this code to score out q2 and q3 into their own new columns, which I could then sum up into a new column, and then from that column, I could generate two more columns which could calculate the number of correct and incorrect scores from that sum.然后，我可以重新使用此代码将 q2 和 q3 评分到他们自己的新列中，然后我可以将其汇总到一个新列中，然后从该列中，我可以生成另外两列可以计算正确的数量以及该总和的错误分数。 Finally, I could go back and drop the other 4 columns that I created and keep only the two that I wanted in the first place.最后，我可以返回并删除我创建的其他 4 列，只保留我最初想要的两列。

Looking around and trying different methods for the last two hours, I'm finding a lot of answers that deal with one or another of the different issues I'm trying to deal with, but nothing that I could finagle to actually work for my situation.在过去的两个小时里环顾四周并尝试不同的方法，我找到了很多解决我正在尝试处理的一个或另一个不同问题的答案，但我无法真正解决我的情况. Maybe the solution I've kludged together is the best one, but I'm still relatively new to programming (<18 months) and it didn't seem like the most efficient or most Pythonic method to solve this problem.也许我拼凑在一起的解决方案是最好的，但我对编程还是比较陌生（<18 个月），而且它似乎不是解决这个问题的最有效或最 Pythonic 的方法。 Hoping someone else has a better answer out there.希望其他人有更好的答案。 Thank you!谢谢！

Edit for more information regarding output: Regarding what I'd like the final output to look like, I'd like something that looks like this:编辑以获取有关输出的更多信息：关于我希望最终输出的样子，我想要看起来像这样的东西：

    q1   q2   q3   c_answers  i_answers
0   a    b    d    2          1
1  NaN   a    b    2          0
2   e    c   NaN   1          1

Like I said, I can kind of finagle that using the nested np.where() to create numeric columns that I can then sum up and reverse engineer to get a raw count from.就像我说的那样，我可以使用嵌套的np.where()创建数字列，然后我可以对其进行总结和逆向工程以获取原始计数。 While this is a solution, its cumbersome and seems like its probably not the optimal one, especially with the amount of repetition involved (I'll have to do this process for 9 different groups of columns, each being a cluster of questions).虽然这是一个解决方案，但它很麻烦，而且看起来可能不是最佳解决方案，尤其是涉及到重复次数（我必须为 9 组不同的列执行此过程，每组都是一组问题）。

Answer 1

Use sum for count True s values for correct and incorrect values per rows:将sum用于 count True s 值来表示每行正确和不正确的值：

m1 = ex_df.isin(correct_answers)
m2 = ex_df.notna() & ~m1

df = ex_df.assign(c_answers=m1.sum(axis=1), i_answers=m2.sum(axis=1))
print (df)
    q1 q2   q3  c_answers  i_answers
0    a  b    d          2          1
1  NaN  a    b          2          0
2    c  e  NaN          1          1

Possible solution for multiple groups:多组的可能解决方案：

groups = {'g1':['q1','q2'], 'g2':['q2','q3'], 'g3':['q1','q2','q3']}

for k, v in groups.items():
    m1 = ex_df[v].isin(correct_answers)
    m2 = ex_df[v].notna() & ~m1
    
    ex_df = ex_df.assign(**{f'c_answers_{k}':m1.sum(axis=1), 
                            f'i_answers_{k}':m2.sum(axis=1)})
print (ex_df)
    q1 q2   q3  c_answers_g1  i_answers_g1  c_answers_g2  i_answers_g2  \
0    a  b    d             2             0             1             1   
1  NaN  a    b             1             0             2             0   
2    c  e  NaN             1             1             0             1   

   c_answers_g3  i_answers_g3  
0             2             1  
1             2             0  
2             1             1

多个 if/then 评估数据框中列的值并有条件地修改另一列值的最佳方法是什么？

问题描述

1 个解决方案

解决方案1
1 2022-06-02 05:27:14

多个 if/then 评估数据框中列的值并有条件地修改另一列值的最佳方法是什么？

问题描述

1 个解决方案

解决方案1 1 2022-06-02 05:27:14

解决方案1
1 2022-06-02 05:27:14