[英]What is the optimal method for multiple if/then's to evaluate a column's value in a dataframe and conditionally modify another column value?
I have a dataframe (1580 rows x 48 columns) where each column contains answers to questions, but not every row contains an answer to every question (leaving it NaN
).我有一个数据框(1580 行 x 48 列),其中每一列都包含问题的答案,但并非每一行都包含每个问题的答案(保留为NaN
)。 Groups of questions are related, and I'd like to tabulate the answers to the group of questions into new columns ( c_answers
and i_answers
).问题组是相关的,我想将这组问题的答案制成新列( c_answers
和i_answers
)。 I have generated lists of the correct answers for each group of questions.我已经为每组问题生成了正确答案的列表。 Here is an example of the data:以下是数据示例:
ex_df = pd.DataFrame([["a", "b", "d"],[np.nan, "a", "b"], ["c", "e", np.nan]], columns=["q1", "q2", "q3"])
correct_answers = ["a", "b", "c"]
ex_df
which generates the following dataframe:生成以下数据框:
q1 q2 q3
0 a b d
1 NaN a b
2 e c NaN
What I would like to do, ideally, is to create a function that would score each column, and for each correct answer on a row (appears in the correct_answers
list) it would increment a c_answers
column by 1, for each answer that is not in correct_answers
, it would increment a i_answers
column by 1 instead, but if the provided answer is NaN
, it would do neither (not counted as correct or incorrect).理想情况下,我想做的是创建一个对每一列进行评分的函数,并且对于一行中的每个正确答案(出现在正确答案列表中),对于每个不是的答案,它c_answers
correct_answers
增加 1在correct_answers
中,它会将i_answers
列增加 1,但如果提供的答案是NaN
,则两者都不做(不计为正确或不正确)。 This function could then be applied to each group of questions, calculating the number of correct and incorrect answers for each row, for that group.然后可以将该函数应用于每组问题,计算该组每行正确和错误答案的数量。
What I have been able to make a bit of progress with instead is something like this:相反,我能够取得一些进展的是这样的:
ex_df['q1score'] = np.where(ex_df['q1'].isna(), np.nan,
np.where(ex_df['q1'].isin(correct_answers), 1, 100))
which updates the dataframe like so:它像这样更新数据框:
q1 q2 q3 q1score
0 a b d 1.0
1 NaN a b NaN
2 e c NaN 100.0
I could then re-use this code to score out q2 and q3 into their own new columns, which I could then sum up into a new column, and then from that column, I could generate two more columns which could calculate the number of correct and incorrect scores from that sum.然后,我可以重新使用此代码将 q2 和 q3 评分到他们自己的新列中,然后我可以将其汇总到一个新列中,然后从该列中,我可以生成另外两列可以计算正确的数量以及该总和的错误分数。 Finally, I could go back and drop the other 4 columns that I created and keep only the two that I wanted in the first place.最后,我可以返回并删除我创建的其他 4 列,只保留我最初想要的两列。
Looking around and trying different methods for the last two hours, I'm finding a lot of answers that deal with one or another of the different issues I'm trying to deal with, but nothing that I could finagle to actually work for my situation.在过去的两个小时里环顾四周并尝试不同的方法,我找到了很多解决我正在尝试处理的一个或另一个不同问题的答案,但我无法真正解决我的情况. Maybe the solution I've kludged together is the best one, but I'm still relatively new to programming (<18 months) and it didn't seem like the most efficient or most Pythonic method to solve this problem.也许我拼凑在一起的解决方案是最好的,但我对编程还是比较陌生(<18 个月),而且它似乎不是解决这个问题的最有效或最 Pythonic 的方法。 Hoping someone else has a better answer out there.希望其他人有更好的答案。 Thank you!谢谢!
Edit for more information regarding output: Regarding what I'd like the final output to look like, I'd like something that looks like this:编辑以获取有关输出的更多信息:关于我希望最终输出的样子,我想要看起来像这样的东西:
q1 q2 q3 c_answers i_answers
0 a b d 2 1
1 NaN a b 2 0
2 e c NaN 1 1
Like I said, I can kind of finagle that using the nested np.where()
to create numeric columns that I can then sum up and reverse engineer to get a raw count from.就像我说的那样,我可以使用嵌套的np.where()
创建数字列,然后我可以对其进行总结和逆向工程以获取原始计数。 While this is a solution, its cumbersome and seems like its probably not the optimal one, especially with the amount of repetition involved (I'll have to do this process for 9 different groups of columns, each being a cluster of questions).虽然这是一个解决方案,但它很麻烦,而且看起来可能不是最佳解决方案,尤其是涉及到重复次数(我必须为 9 组不同的列执行此过程,每组都是一组问题)。
Use sum
for count True
s values for correct and incorrect values per rows:将sum
用于 count True
s 值来表示每行正确和不正确的值:
m1 = ex_df.isin(correct_answers)
m2 = ex_df.notna() & ~m1
df = ex_df.assign(c_answers=m1.sum(axis=1), i_answers=m2.sum(axis=1))
print (df)
q1 q2 q3 c_answers i_answers
0 a b d 2 1
1 NaN a b 2 0
2 c e NaN 1 1
Possible solution for multiple groups:多组的可能解决方案:
groups = {'g1':['q1','q2'], 'g2':['q2','q3'], 'g3':['q1','q2','q3']}
for k, v in groups.items():
m1 = ex_df[v].isin(correct_answers)
m2 = ex_df[v].notna() & ~m1
ex_df = ex_df.assign(**{f'c_answers_{k}':m1.sum(axis=1),
f'i_answers_{k}':m2.sum(axis=1)})
print (ex_df)
q1 q2 q3 c_answers_g1 i_answers_g1 c_answers_g2 i_answers_g2 \
0 a b d 2 0 1 1
1 NaN a b 1 0 2 0
2 c e NaN 1 1 0 1
c_answers_g3 i_answers_g3
0 2 1
1 2 0
2 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.