Pandas - KeyError - 在嵌套循環中按索引刪除行

Question

我有一個名為 pd 的 Pandas 數據框。 我正在嘗試使用嵌套 for 循環遍歷數據幀的每個元組，並在每次迭代時將元組與幀中的所有其他元組進行比較。 在比較步驟中，我使用 Python 的 difflib.SequenceMatcher().ratio() 並刪除具有高相似性（比率 > 0.8）的元組。

問題：不幸的是，我在第一次外循環迭代后收到 KeyError。

我懷疑，通過刪除元組，我會使外循環的索引器無效。 或者，我通過嘗試訪問不存在（已刪除）的元素來使內循環的索引器無效。

這是代碼：

import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher

# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.

with open("twitter data/tweetsR.json", "r") as read_file:
    data = json.load(read_file)  # Load the source data set, esport tweets.

df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content.  Note, 
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.

def duplicates(df):
    for ind in df.index:
        a = df['text'][ind]
        for indd in df.index:
            if indd != 26747: # Trying to prevent an overstep keyError here
                b = df['text'][indd+1]
                if similar(a,b) >= 0.80:
                    df.drop((indd+1), inplace=True)
        print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed

duplicates(df)

錯誤輸出：

任何人都可以幫助我理解和/或修復它嗎？

Answer 1

@KazuyaHatta 提到的一種解決方案是 itertools.combination()。 雖然，我使用它的方式（可能還有另一種方式），它是 O(n^2)。 因此，在這種情況下，有 27,000 個元組，需要迭代近 357,714,378 個組合（太長）。

這是代碼：

# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
    # Find out how to improve the speed of this
    excludes = set()
    combos = itertools.combinations(df.index, 2)
    for combo in combos:
        if str(combo) not in excludes:
            if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
                excludes.add(f'{combo[0]}, {combo[1]}') 
                excludes.add(f'{combo[1]}, {combo[0]}')
                print("Dropped: " + str(combo))
                print(len(excludes))

duplicates(df)

@KazuyaHatta 描述的我的下一步是嘗試按掩碼刪除方法。

注意：很遺憾，我無法發布數據集樣本。

Pandas - KeyError - 在嵌套循環中按索引刪除行

問題描述

1 個解決方案

解決方案1
2 已采納 2019-12-20 06:48:39

Pandas - KeyError - 在嵌套循環中按索引刪除行

問題描述

1 個解決方案

解決方案1 2 已采納 2019-12-20 06:48:39

解決方案1
2 已采納 2019-12-20 06:48:39