[英]Remove specific duplicates from df/list of lists
我有以下 pandas df(虛擬 df,原始數據約有 50'000 行)。
columns = ['question_id', 'answer', 'is_correct']
data = [['1','hello','1.0'],
['1','hello', '1.0'],
['1','hello', '1.0'],
['2', 'dog', '0.0'],
['2', 'cat', '1.0'],
['2', 'dog', '0.0'],
['2', 'the answer is cat', '1.0'],
['3', 'Milan', '1.0'],
['3', 'Paris', '0.0'],
['3', 'The capital is Paris', '0.0'],
['3', 'MILAN', '1.0'],
['4', 'The capital is Paris', '1.0'],
['4', 'London', '0.0'],
['4', 'Paris', '1.0'],
['4', 'paris', '1.0'],
['5', 'lol', '0.0'],
['5', 'rofl', '0.0'],
['6', '5.5', '1.0'],
['6', '5.2', '0.0']]
df = pd.DataFrame(columns=columns, data=data)
我想返回一個列表列表。 內部列表應包含來自同一問題的兩個正確 (is_correct = 1.0) 答案(a1_correct 和 a2_correct)和一個不正確 (is_correct = 0.0) 答案 (a_incorrect)。 重要提示:如果 a1_correct 等於 a2_correct,則跳過該問題,我不想與 a1_correct 和 a2_correct 重復。 每個 question_id 一個內部列表。 question_id 中的其他答案可以簡單地忽略。
邊緣案例:
我希望 output 看起來像:
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'paris', 'London']]
我目前的方法包括重復,我該如何解決? 我應該先從 df 中刪除重復項,然后創建列表列表嗎?
import builtins
def create_triplet(grp):
is_correct = grp['is_correct'] == 1.0
is_wrong = grp['is_correct'] == 0.0
if (is_correct.value_counts().get(True, 0) >= 2) and is_wrong.any():
a1_correct = grp['answer'][is_correct].iloc[0]
a2_correct = grp['answer'][is_correct].iloc[1]
#here I tried to ignore duplicates but it doesn't work
if a1_correct == a2_correct:
return
else: grp['answer'][is_correct].iloc[1]
incorrect = grp['answer'][is_wrong].iloc[0]
return [a1_correct, a2_correct, incorrect]
triplets_raw = df.groupby('question_id').apply(create_triplet)
triplets_list = list(builtins.filter(lambda x: (x is not None), triplets_raw.to_list()))
由於您不希望正確答案有任何重復項,因此請在選擇 2 個正確答案之前使用 drop_duplicates() 以刪除正確答案中的任何重復項。 從這些答案中選擇的 2 個答案將是唯一的。 然后不知何故 select (最多)2個答案,同樣的錯誤答案。
選擇正確和錯誤的答案后,如果我理解正確,create_triplets 應該只在有 2 個正確和 1 個錯誤的答案返回時返回一些東西。 例如, len() 可以很好地解決這個問題。
我稍微修改了您提供的代碼,生成了預期的 output。
代碼中還有一些注釋和代碼后的示例輸出,用於闡明代碼的作用。
import pandas as pd
def create_triplet(grp):
# Select unique, correct answers
correct = grp.loc[grp['is_correct'] == '1.0', 'answer'].drop_duplicates()
# Select up to 2 correct answers and change to a list
correct = list(correct.iloc[:2])
# Repeat similarly to wrong answers expect only take up to 1 correct answer(s)
# The same thing in one line
# May or may not be easier to read, use whichever you prefer
# Note: drop_duplicates is not necessary here
wrong = list(grp.loc[grp['is_correct'] == '0.0', 'answer'].drop_duplicates().iloc[:1])
# Question should not be skipped when there are (at least)
# 2 different but correct answers and 1 wrong answer
if len(correct) == 2 and len(wrong) == 1:
return correct + wrong
# Otherwise signify skipping the question by returning None
return None
columns = ['question_id', 'answer', 'is_correct']
data = [
['1', 'hello', '1.0'],
['1', 'hello', '1.0'],
['1', 'hello', '1.0'],
['2', 'dog', '0.0'],
['2', 'cat', '1.0'],
['2', 'dog', '0.0'],
['2', 'the answer is cat', '1.0'],
['3', 'Milan', '1.0'],
['3', 'Paris', '0.0'],
['3', 'The capital is Paris', '0.0'],
['3', 'MILAN', '1.0'],
['4', 'The capital is Paris', '1.0'],
['4', 'London', '0.0'],
['4', 'Paris', '1.0'],
['4', 'paris', '1.0'],
['5', 'lol', '0.0'],
['5', 'rofl', '0.0'],
['6', '5.5', '1.0'],
['6', '5.2', '0.0']
]
df = pd.DataFrame(columns=columns, data=data)
expected = [
['cat', 'the answer is cat', 'dog'],
['Milan', 'MILAN', 'Paris'],
['The capital is Paris', 'paris', 'London']
]
triplets_raw = df.groupby('question_id').apply(create_triplet)
# Triplets_raw is a pandas Series with values being either
# a list of valid responses or None
# dropna() removes rows with None-values, leaving only rows with lists
# The resulting Series is then changed to list as required
triplest_list = list(triplets_raw.dropna())
一些輸出:
>>> df.groupby('question_id').apply(create_triplet)
question_id
1 None
2 [cat, the answer is cat, dog]
3 [Milan, MILAN, Paris]
4 [The capital is Paris, Paris, London]
5 None
6 None
>>> triplets_raw = df.groupby('question_id').apply(create_triplet)
>>> list(triplets_raw.dropna())
[['cat', 'the answer is cat', 'dog'], ['Milan', 'MILAN', 'Paris'], ['The capital is Paris', 'Paris', 'London']]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.