简体   繁体   English

多个 if/then 评估数据框中列的值并有条件地修改另一列值的最佳方法是什么?

[英]What is the optimal method for multiple if/then's to evaluate a column's value in a dataframe and conditionally modify another column value?

I have a dataframe (1580 rows x 48 columns) where each column contains answers to questions, but not every row contains an answer to every question (leaving it NaN ).我有一个数据框(1580 行 x 48 列),其中每一列都包含问题的答案,但并非每一行都包含每个问题的答案(保留为NaN )。 Groups of questions are related, and I'd like to tabulate the answers to the group of questions into new columns ( c_answers and i_answers ).问题组是相关的,我想将这组问题的答案制成新列( c_answersi_answers )。 I have generated lists of the correct answers for each group of questions.我已经为每组问题生成了正确答案的列表。 Here is an example of the data:以下是数据示例:

ex_df = pd.DataFrame([["a", "b", "d"],[np.nan, "a", "b"], ["c", "e", np.nan]], columns=["q1", "q2", "q3"])
correct_answers = ["a", "b", "c"]
ex_df

which generates the following dataframe:生成以下数据框:

    q1   q2   q3
0   a    b    d
1  NaN   a    b
2   e    c   NaN

What I would like to do, ideally, is to create a function that would score each column, and for each correct answer on a row (appears in the correct_answers list) it would increment a c_answers column by 1, for each answer that is not in correct_answers , it would increment a i_answers column by 1 instead, but if the provided answer is NaN , it would do neither (not counted as correct or incorrect).理想情况下,我想做的是创建一个对每一列进行评分的函数,并且对于一行中的每个正确答案(出现在正确答案列表中),对于每个不是的答案,它c_answers correct_answers增加 1在correct_answers中,它会将i_answers列增加 1,但如果提供的答案是NaN ,则两者都不做(不计为正确或不正确)。 This function could then be applied to each group of questions, calculating the number of correct and incorrect answers for each row, for that group.然后可以将该函数应用于每组问题,计算该组每行正确和错误答案的数量。

What I have been able to make a bit of progress with instead is something like this:相反,我能够取得一些进展的是这样的:

ex_df['q1score'] = np.where(ex_df['q1'].isna(), np.nan, 
                          np.where(ex_df['q1'].isin(correct_answers), 1, 100))

which updates the dataframe like so:它像这样更新数据框:

    q1   q2   q3   q1score
0   a    b    d    1.0
1  NaN   a    b    NaN
2   e    c   NaN   100.0

I could then re-use this code to score out q2 and q3 into their own new columns, which I could then sum up into a new column, and then from that column, I could generate two more columns which could calculate the number of correct and incorrect scores from that sum.然后,我可以重新使用此代码将 q2 和 q3 评分到他们自己的新列中,然后我可以将其汇总到一个新列中,然后从列中,我可以生成另外两列可以计算正确的数量以及该总和的错误分数。 Finally, I could go back and drop the other 4 columns that I created and keep only the two that I wanted in the first place.最后,我可以返回并删除我创建的其他 4 列,只保留我最初想要的两列。

Looking around and trying different methods for the last two hours, I'm finding a lot of answers that deal with one or another of the different issues I'm trying to deal with, but nothing that I could finagle to actually work for my situation.在过去的两个小时里环顾四周并尝试不同的方法,我找到了很多解决我正在尝试处理的一个或另一个不同问题的答案,但我无法真正解决我的情况. Maybe the solution I've kludged together is the best one, but I'm still relatively new to programming (<18 months) and it didn't seem like the most efficient or most Pythonic method to solve this problem.也许我拼凑在一起的解决方案是最好的,但我对编程还是比较陌生(<18 个月),而且它似乎不是解决这个问题的最有效或最 Pythonic 的方法。 Hoping someone else has a better answer out there.希望其他人有更好的答案。 Thank you!谢谢!

Edit for more information regarding output: Regarding what I'd like the final output to look like, I'd like something that looks like this:编辑以获取有关输出的更多信息:关于我希望最终输出的样子,我想要看起来像这样的东西:

    q1   q2   q3   c_answers  i_answers
0   a    b    d    2          1
1  NaN   a    b    2          0
2   e    c   NaN   1          1

Like I said, I can kind of finagle that using the nested np.where() to create numeric columns that I can then sum up and reverse engineer to get a raw count from.就像我说的那样,我可以使用嵌套的np.where()创建数字列,然后我可以对其进行总结和逆向工程以获取原始计数。 While this is a solution, its cumbersome and seems like its probably not the optimal one, especially with the amount of repetition involved (I'll have to do this process for 9 different groups of columns, each being a cluster of questions).虽然这是一个解决方案,但它很麻烦,而且看起来可能不是最佳解决方案,尤其是涉及到重复次数(我必须为 9 组不同的列执行此过程,每组都是一组问题)。

Use sum for count True s values for correct and incorrect values per rows:sum用于 count True s 值来表示每行正确和不正确的值:

m1 = ex_df.isin(correct_answers)
m2 = ex_df.notna() & ~m1

df = ex_df.assign(c_answers=m1.sum(axis=1), i_answers=m2.sum(axis=1))
print (df)
    q1 q2   q3  c_answers  i_answers
0    a  b    d          2          1
1  NaN  a    b          2          0
2    c  e  NaN          1          1

Possible solution for multiple groups:多组的可能解决方案:

groups = {'g1':['q1','q2'], 'g2':['q2','q3'], 'g3':['q1','q2','q3']}

for k, v in groups.items():
    m1 = ex_df[v].isin(correct_answers)
    m2 = ex_df[v].notna() & ~m1
    
    ex_df = ex_df.assign(**{f'c_answers_{k}':m1.sum(axis=1), 
                            f'i_answers_{k}':m2.sum(axis=1)})
print (ex_df)
    q1 q2   q3  c_answers_g1  i_answers_g1  c_answers_g2  i_answers_g2  \
0    a  b    d             2             0             1             1   
1  NaN  a    b             1             0             2             0   
2    c  e  NaN             1             1             0             1   

   c_answers_g3  i_answers_g3  
0             2             1  
1             2             0  
2             1             1  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM