简体   繁体   English

从 df/list of lists 中删除特定的重复项

[英]Remove specific duplicates from df/list of lists

I have the following pandas df (dummy df, original has around 50'000 rows).我有以下 pandas df(虚拟 df,原始数据约有 50'000 行)。

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)

I want to return a list of lists.我想返回一个列表列表。 An inner list should contain exactly two correct (is_correct = 1.0) answers (a1_correct and a2_correct) and one incorrect (is_correct = 0.0) answer (a_incorrect) from the same question.内部列表应包含来自同一问题的两个正确 (is_correct = 1.0) 答案(a1_correct 和 a2_correct)和一个不正确 (is_correct = 0.0) 答案 (a_incorrect)。 Important : if a1_correct equals a2_correct, then skip that question, I do not want to have duplicates with a1_correct and a2_correct.重要提示:如果 a1_correct 等于 a2_correct,则跳过该问题,我不想与 a1_correct 和 a2_correct 重复。 One inner list per question_id.每个 question_id 一个内部列表。 The other answers within a question_id can simply be ignored. question_id 中的其他答案可以简单地忽略。

Edge cases:边缘案例:

  • All answers are correct -> Skip this question所有答案都正确 -> 跳过此问题
  • All correct answers are duplicates -> Skip this question所有正确答案都是重复的 -> 跳过此问题
  • No answer is correct -> Skip this question.没有答案是正确的 -> 跳过这个问题。 Eg output None.例如 output 无。 See question_id = 5见 question_id = 5
  • Only one answer is correct -> Skip this question.只有一个答案是正确的 -> 跳过此问题。 Eg output None.例如 output 无。 See question_id = 5见 question_id = 5

What I want the output to look like:我希望 output 看起来像:

[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'paris', 'London']]

My current approach includes the duplicates, how can I fix that?我目前的方法包括重复,我该如何解决? Should I first remove the duplicates from the df and then create the list of lists?我应该先从 df 中删除重复项,然后创建列表列表吗?

import builtins

def create_triplet(grp):
    is_correct = grp['is_correct'] == 1.0
    is_wrong = grp['is_correct'] == 0.0
    if (is_correct.value_counts().get(True, 0) >= 2) and is_wrong.any():
      a1_correct = grp['answer'][is_correct].iloc[0]
      a2_correct = grp['answer'][is_correct].iloc[1]
      #here I tried to ignore duplicates but it doesn't work
      if a1_correct == a2_correct:
        return
      else: grp['answer'][is_correct].iloc[1]
      incorrect = grp['answer'][is_wrong].iloc[0]
      return [a1_correct, a2_correct, incorrect]

triplets_raw = df.groupby('question_id').apply(create_triplet)
triplets_list = list(builtins.filter(lambda x: (x is not None), triplets_raw.to_list()))

Since you don't want any duplicates for the correct answers, use drop_duplicates() before selecting the 2 correct answers to remove any duplicates in the correct answers.由于您不希望正确答案有任何重复项,因此请在选择 2 个正确答案之前使用 drop_duplicates() 以删除正确答案中的任何重复项。 2 answers selected from these will be unique.从这些答案中选择的 2 个答案将是唯一的。 Then somehow select (up to) 2 answers and similarly for the wrong answers.然后不知何故 select (最多)2个答案,同样的错误答案。

After selecting correct and wrong answers, if I understood correctly, create_triplets should only return something when there are 2 correct and 1 wrong answers to return.选择正确和错误的答案后,如果我理解正确,create_triplets 应该只在有 2 个正确和 1 个错误的答案返回时返回一些东西。 For example, len() works fine for this.例如, len() 可以很好地解决这个问题。

I modified the code you provided a little bit, which produced the expected output.我稍微修改了您提供的代码,生成了预期的 output。

There's also some comments in the code and sample outputs after the code for clarifying what the code does.代码中还有一些注释和代码后的示例输出,用于阐明代码的作用。

import pandas as pd

def create_triplet(grp):
    # Select unique, correct answers
    correct = grp.loc[grp['is_correct'] == '1.0', 'answer'].drop_duplicates()
    # Select up to 2 correct answers and change to a list
    correct = list(correct.iloc[:2])
    # Repeat similarly to wrong answers expect only take up to 1 correct answer(s)
    # The same thing in one line
    # May or may not be easier to read, use whichever you prefer
    # Note: drop_duplicates is not necessary here
    wrong = list(grp.loc[grp['is_correct'] == '0.0', 'answer'].drop_duplicates().iloc[:1])
    # Question should not be skipped when there are (at least)
    # 2 different but correct answers and 1 wrong answer
    if len(correct) == 2 and len(wrong) == 1:
        return correct + wrong
    # Otherwise signify skipping the question by returning None
    return None


columns = ['question_id', 'answer', 'is_correct']
data = [
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'cat', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'the answer is cat', '1.0'],
    ['3', 'Milan', '1.0'],
    ['3', 'Paris', '0.0'],
    ['3', 'The capital is Paris', '0.0'],
    ['3', 'MILAN', '1.0'],
    ['4', 'The capital is Paris', '1.0'],
    ['4', 'London', '0.0'],
    ['4', 'Paris', '1.0'],
    ['4', 'paris', '1.0'],
    ['5', 'lol', '0.0'],
    ['5', 'rofl', '0.0'],
    ['6', '5.5', '1.0'],
    ['6', '5.2', '0.0']
]
df = pd.DataFrame(columns=columns, data=data)
expected = [
    ['cat', 'the answer is cat', 'dog'],
    ['Milan', 'MILAN', 'Paris'],
    ['The capital is Paris', 'paris', 'London']
]

triplets_raw = df.groupby('question_id').apply(create_triplet)
# Triplets_raw is a pandas Series with values being either
# a list of valid responses or None
# dropna() removes rows with None-values, leaving only rows with lists
# The resulting Series is then changed to list as required
triplest_list = list(triplets_raw.dropna())

Some outputs:一些输出:

>>> df.groupby('question_id').apply(create_triplet)
question_id
1                                     None
2            [cat, the answer is cat, dog]
3                    [Milan, MILAN, Paris]
4    [The capital is Paris, Paris, London]
5                                     None
6                                     None
>>> triplets_raw = df.groupby('question_id').apply(create_triplet)
>>> list(triplets_raw.dropna())
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'Paris', 'London']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM