從 df/list of lists 中刪除特定的重復項

Question

我有以下 pandas df（虛擬 df，原始數據約有 50'000 行）。

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)

我想返回一個列表列表。 內部列表應包含來自同一問題的兩個正確 (is_correct = 1.0) 答案（a1_correct 和 a2_correct）和一個不正確 (is_correct = 0.0) 答案 (a_incorrect)。 重要提示：如果 a1_correct 等於 a2_correct，則跳過該問題，我不想與 a1_correct 和 a2_correct 重復。 每個 question_id 一個內部列表。 question_id 中的其他答案可以簡單地忽略。

邊緣案例：

所有答案都正確 -> 跳過此問題
所有正確答案都是重復的 -> 跳過此問題
沒有答案是正確的 -> 跳過這個問題。 例如 output 無。 見 question_id = 5
只有一個答案是正確的 -> 跳過此問題。 例如 output 無。 見 question_id = 5

我希望 output 看起來像：

[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'paris', 'London']]

我目前的方法包括重復，我該如何解決？ 我應該先從 df 中刪除重復項，然后創建列表列表嗎？

import builtins

def create_triplet(grp):
    is_correct = grp['is_correct'] == 1.0
    is_wrong = grp['is_correct'] == 0.0
    if (is_correct.value_counts().get(True, 0) >= 2) and is_wrong.any():
      a1_correct = grp['answer'][is_correct].iloc[0]
      a2_correct = grp['answer'][is_correct].iloc[1]
      #here I tried to ignore duplicates but it doesn't work
      if a1_correct == a2_correct:
        return
      else: grp['answer'][is_correct].iloc[1]
      incorrect = grp['answer'][is_wrong].iloc[0]
      return [a1_correct, a2_correct, incorrect]

triplets_raw = df.groupby('question_id').apply(create_triplet)
triplets_list = list(builtins.filter(lambda x: (x is not None), triplets_raw.to_list()))

Answer 1

由於您不希望正確答案有任何重復項，因此請在選擇 2 個正確答案之前使用 drop_duplicates() 以刪除正確答案中的任何重復項。 從這些答案中選擇的 2 個答案將是唯一的。 然后不知何故 select （最多）2個答案，同樣的錯誤答案。

選擇正確和錯誤的答案后，如果我理解正確，create_triplets 應該只在有 2 個正確和 1 個錯誤的答案返回時返回一些東西。 例如， len() 可以很好地解決這個問題。

我稍微修改了您提供的代碼，生成了預期的 output。

代碼中還有一些注釋和代碼后的示例輸出，用於闡明代碼的作用。

import pandas as pd

def create_triplet(grp):
    # Select unique, correct answers
    correct = grp.loc[grp['is_correct'] == '1.0', 'answer'].drop_duplicates()
    # Select up to 2 correct answers and change to a list
    correct = list(correct.iloc[:2])
    # Repeat similarly to wrong answers expect only take up to 1 correct answer(s)
    # The same thing in one line
    # May or may not be easier to read, use whichever you prefer
    # Note: drop_duplicates is not necessary here
    wrong = list(grp.loc[grp['is_correct'] == '0.0', 'answer'].drop_duplicates().iloc[:1])
    # Question should not be skipped when there are (at least)
    # 2 different but correct answers and 1 wrong answer
    if len(correct) == 2 and len(wrong) == 1:
        return correct + wrong
    # Otherwise signify skipping the question by returning None
    return None


columns = ['question_id', 'answer', 'is_correct']
data = [
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'cat', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'the answer is cat', '1.0'],
    ['3', 'Milan', '1.0'],
    ['3', 'Paris', '0.0'],
    ['3', 'The capital is Paris', '0.0'],
    ['3', 'MILAN', '1.0'],
    ['4', 'The capital is Paris', '1.0'],
    ['4', 'London', '0.0'],
    ['4', 'Paris', '1.0'],
    ['4', 'paris', '1.0'],
    ['5', 'lol', '0.0'],
    ['5', 'rofl', '0.0'],
    ['6', '5.5', '1.0'],
    ['6', '5.2', '0.0']
]
df = pd.DataFrame(columns=columns, data=data)
expected = [
    ['cat', 'the answer is cat', 'dog'],
    ['Milan', 'MILAN', 'Paris'],
    ['The capital is Paris', 'paris', 'London']
]

triplets_raw = df.groupby('question_id').apply(create_triplet)
# Triplets_raw is a pandas Series with values being either
# a list of valid responses or None
# dropna() removes rows with None-values, leaving only rows with lists
# The resulting Series is then changed to list as required
triplest_list = list(triplets_raw.dropna())

一些輸出：

>>> df.groupby('question_id').apply(create_triplet)
question_id
1                                     None
2            [cat, the answer is cat, dog]
3                    [Milan, MILAN, Paris]
4    [The capital is Paris, Paris, London]
5                                     None
6                                     None
>>> triplets_raw = df.groupby('question_id').apply(create_triplet)
>>> list(triplets_raw.dropna())
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'Paris', 'London']]

從 df/list of lists 中刪除特定的重復項

問題描述

1 個解決方案

解決方案1
1 已采納 2021-01-28 11:26:41

從 df/list of lists 中刪除特定的重復項

問題描述

1 個解決方案

解決方案1 1 已采納 2021-01-28 11:26:41

解決方案1
1 已采納 2021-01-28 11:26:41