简体   繁体   中英

Remove specific duplicates from df/list of lists

I have the following pandas df (dummy df, original has around 50'000 rows).

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)

I want to return a list of lists. An inner list should contain exactly two correct (is_correct = 1.0) answers (a1_correct and a2_correct) and one incorrect (is_correct = 0.0) answer (a_incorrect) from the same question. Important : if a1_correct equals a2_correct, then skip that question, I do not want to have duplicates with a1_correct and a2_correct. One inner list per question_id. The other answers within a question_id can simply be ignored.

Edge cases:

  • All answers are correct -> Skip this question
  • All correct answers are duplicates -> Skip this question
  • No answer is correct -> Skip this question. Eg output None. See question_id = 5
  • Only one answer is correct -> Skip this question. Eg output None. See question_id = 5

What I want the output to look like:

[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'paris', 'London']]

My current approach includes the duplicates, how can I fix that? Should I first remove the duplicates from the df and then create the list of lists?

import builtins

def create_triplet(grp):
    is_correct = grp['is_correct'] == 1.0
    is_wrong = grp['is_correct'] == 0.0
    if (is_correct.value_counts().get(True, 0) >= 2) and is_wrong.any():
      a1_correct = grp['answer'][is_correct].iloc[0]
      a2_correct = grp['answer'][is_correct].iloc[1]
      #here I tried to ignore duplicates but it doesn't work
      if a1_correct == a2_correct:
        return
      else: grp['answer'][is_correct].iloc[1]
      incorrect = grp['answer'][is_wrong].iloc[0]
      return [a1_correct, a2_correct, incorrect]

triplets_raw = df.groupby('question_id').apply(create_triplet)
triplets_list = list(builtins.filter(lambda x: (x is not None), triplets_raw.to_list()))

Since you don't want any duplicates for the correct answers, use drop_duplicates() before selecting the 2 correct answers to remove any duplicates in the correct answers. 2 answers selected from these will be unique. Then somehow select (up to) 2 answers and similarly for the wrong answers.

After selecting correct and wrong answers, if I understood correctly, create_triplets should only return something when there are 2 correct and 1 wrong answers to return. For example, len() works fine for this.

I modified the code you provided a little bit, which produced the expected output.

There's also some comments in the code and sample outputs after the code for clarifying what the code does.

import pandas as pd

def create_triplet(grp):
    # Select unique, correct answers
    correct = grp.loc[grp['is_correct'] == '1.0', 'answer'].drop_duplicates()
    # Select up to 2 correct answers and change to a list
    correct = list(correct.iloc[:2])
    # Repeat similarly to wrong answers expect only take up to 1 correct answer(s)
    # The same thing in one line
    # May or may not be easier to read, use whichever you prefer
    # Note: drop_duplicates is not necessary here
    wrong = list(grp.loc[grp['is_correct'] == '0.0', 'answer'].drop_duplicates().iloc[:1])
    # Question should not be skipped when there are (at least)
    # 2 different but correct answers and 1 wrong answer
    if len(correct) == 2 and len(wrong) == 1:
        return correct + wrong
    # Otherwise signify skipping the question by returning None
    return None


columns = ['question_id', 'answer', 'is_correct']
data = [
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'cat', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'the answer is cat', '1.0'],
    ['3', 'Milan', '1.0'],
    ['3', 'Paris', '0.0'],
    ['3', 'The capital is Paris', '0.0'],
    ['3', 'MILAN', '1.0'],
    ['4', 'The capital is Paris', '1.0'],
    ['4', 'London', '0.0'],
    ['4', 'Paris', '1.0'],
    ['4', 'paris', '1.0'],
    ['5', 'lol', '0.0'],
    ['5', 'rofl', '0.0'],
    ['6', '5.5', '1.0'],
    ['6', '5.2', '0.0']
]
df = pd.DataFrame(columns=columns, data=data)
expected = [
    ['cat', 'the answer is cat', 'dog'],
    ['Milan', 'MILAN', 'Paris'],
    ['The capital is Paris', 'paris', 'London']
]

triplets_raw = df.groupby('question_id').apply(create_triplet)
# Triplets_raw is a pandas Series with values being either
# a list of valid responses or None
# dropna() removes rows with None-values, leaving only rows with lists
# The resulting Series is then changed to list as required
triplest_list = list(triplets_raw.dropna())

Some outputs:

>>> df.groupby('question_id').apply(create_triplet)
question_id
1                                     None
2            [cat, the answer is cat, dog]
3                    [Milan, MILAN, Paris]
4    [The capital is Paris, Paris, London]
5                                     None
6                                     None
>>> triplets_raw = df.groupby('question_id').apply(create_triplet)
>>> list(triplets_raw.dropna())
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'Paris', 'London']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM