簡體   English   中英

從 df/list of lists 中刪除特定的重復項

[英]Remove specific duplicates from df/list of lists

我有以下 pandas df(虛擬 df,原始數據約有 50'000 行)。

columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
       ['1','hello', '1.0'],
       ['1','hello', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'cat', '1.0'],
        ['2', 'dog', '0.0'],
        ['2', 'the answer is cat', '1.0'],
        ['3', 'Milan', '1.0'],
        ['3', 'Paris', '0.0'],
        ['3', 'The capital is Paris', '0.0'],
        ['3', 'MILAN', '1.0'],
        ['4', 'The capital is Paris', '1.0'],
        ['4', 'London', '0.0'],
        ['4', 'Paris', '1.0'],
        ['4', 'paris', '1.0'],
        ['5', 'lol', '0.0'],
        ['5', 'rofl', '0.0'],
        ['6', '5.5', '1.0'],
        ['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)

我想返回一個列表列表。 內部列表應包含來自同一問題的兩個正確 (is_correct = 1.0) 答案(a1_correct 和 a2_correct)和一個不正確 (is_correct = 0.0) 答案 (a_incorrect)。 重要提示:如果 a1_correct 等於 a2_correct,則跳過該問題,我不想與 a1_correct 和 a2_correct 重復。 每個 question_id 一個內部列表。 question_id 中的其他答案可以簡單地忽略。

邊緣案例:

  • 所有答案都正確 -> 跳過此問題
  • 所有正確答案都是重復的 -> 跳過此問題
  • 沒有答案是正確的 -> 跳過這個問題。 例如 output 無。 見 question_id = 5
  • 只有一個答案是正確的 -> 跳過此問題。 例如 output 無。 見 question_id = 5

我希望 output 看起來像:

[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'paris', 'London']]

我目前的方法包括重復,我該如何解決? 我應該先從 df 中刪除重復項,然后創建列表列表嗎?

import builtins

def create_triplet(grp):
    is_correct = grp['is_correct'] == 1.0
    is_wrong = grp['is_correct'] == 0.0
    if (is_correct.value_counts().get(True, 0) >= 2) and is_wrong.any():
      a1_correct = grp['answer'][is_correct].iloc[0]
      a2_correct = grp['answer'][is_correct].iloc[1]
      #here I tried to ignore duplicates but it doesn't work
      if a1_correct == a2_correct:
        return
      else: grp['answer'][is_correct].iloc[1]
      incorrect = grp['answer'][is_wrong].iloc[0]
      return [a1_correct, a2_correct, incorrect]

triplets_raw = df.groupby('question_id').apply(create_triplet)
triplets_list = list(builtins.filter(lambda x: (x is not None), triplets_raw.to_list()))

由於您不希望正確答案有任何重復項,因此請在選擇 2 個正確答案之前使用 drop_duplicates() 以刪除正確答案中的任何重復項。 從這些答案中選擇的 2 個答案將是唯一的。 然后不知何故 select (最多)2個答案,同樣的錯誤答案。

選擇正確和錯誤的答案后,如果我理解正確,create_triplets 應該只在有 2 個正確和 1 個錯誤的答案返回時返回一些東西。 例如, len() 可以很好地解決這個問題。

我稍微修改了您提供的代碼,生成了預期的 output。

代碼中還有一些注釋和代碼后的示例輸出,用於闡明代碼的作用。

import pandas as pd

def create_triplet(grp):
    # Select unique, correct answers
    correct = grp.loc[grp['is_correct'] == '1.0', 'answer'].drop_duplicates()
    # Select up to 2 correct answers and change to a list
    correct = list(correct.iloc[:2])
    # Repeat similarly to wrong answers expect only take up to 1 correct answer(s)
    # The same thing in one line
    # May or may not be easier to read, use whichever you prefer
    # Note: drop_duplicates is not necessary here
    wrong = list(grp.loc[grp['is_correct'] == '0.0', 'answer'].drop_duplicates().iloc[:1])
    # Question should not be skipped when there are (at least)
    # 2 different but correct answers and 1 wrong answer
    if len(correct) == 2 and len(wrong) == 1:
        return correct + wrong
    # Otherwise signify skipping the question by returning None
    return None


columns = ['question_id', 'answer', 'is_correct']
data = [
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['1', 'hello', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'cat', '1.0'],
    ['2', 'dog', '0.0'],
    ['2', 'the answer is cat', '1.0'],
    ['3', 'Milan', '1.0'],
    ['3', 'Paris', '0.0'],
    ['3', 'The capital is Paris', '0.0'],
    ['3', 'MILAN', '1.0'],
    ['4', 'The capital is Paris', '1.0'],
    ['4', 'London', '0.0'],
    ['4', 'Paris', '1.0'],
    ['4', 'paris', '1.0'],
    ['5', 'lol', '0.0'],
    ['5', 'rofl', '0.0'],
    ['6', '5.5', '1.0'],
    ['6', '5.2', '0.0']
]
df = pd.DataFrame(columns=columns, data=data)
expected = [
    ['cat', 'the answer is cat', 'dog'],
    ['Milan', 'MILAN', 'Paris'],
    ['The capital is Paris', 'paris', 'London']
]

triplets_raw = df.groupby('question_id').apply(create_triplet)
# Triplets_raw is a pandas Series with values being either
# a list of valid responses or None
# dropna() removes rows with None-values, leaving only rows with lists
# The resulting Series is then changed to list as required
triplest_list = list(triplets_raw.dropna())

一些輸出:

>>> df.groupby('question_id').apply(create_triplet)
question_id
1                                     None
2            [cat, the answer is cat, dog]
3                    [Milan, MILAN, Paris]
4    [The capital is Paris, Paris, London]
5                                     None
6                                     None
>>> triplets_raw = df.groupby('question_id').apply(create_triplet)
>>> list(triplets_raw.dropna())
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'Paris', 'London']]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM