KeyError：使用 drop_duplicates 时的 Int64Index([1], dtype='int64')

Question

I wrote a simple script that supposed to merge (union) some dataframes and remove the duplicates.我写了一个简单的脚本，它应该合并（联合）一些数据框并删除重复项。

For example, For the input:例如，对于输入：

df_A:
a  1
b  2

df_B:
b  2
c  3

The expected output would be:预期的 output 将是：

df_out:
a  1
b  2
c  3

I wrote the following code:我写了以下代码：

def read_dataframes(filenames, basedir):
    return [pd.read_csv(basedir + file, sep='\t', header=None, quoting=csv.QUOTE_NONE) for file in filenames]


def merge_dataframes(dfs, out):
    merged = pd.concat(dfs).drop_duplicates(subset=[0, 1]).reset_index(drop=True)
    merged = merged.iloc[:, [0, 1, 2, 7, 8, 9]]
    merged.to_csv(out, header=None, index=None, sep='\t')

and I am calling these functions in the following manner:我以下列方式调用这些函数：

merge_dataframes(read_dataframes(filenames, basedir), output)

I am getting an exception of KeyError :我遇到了KeyError异常：

Traceback (most recent call last):
  File "analysis_and_visualization.py", line 70, in <module>
    merge_dataframes(read_dataframes(wild_emb, wild_basedir), 'wild_emb_merged')
  File "analysis_and_visualization.py", line 17, in merge_dataframes
    merged = pd.concat(dfs).drop_duplicates(subset=[0, 1]).reset_index(drop=True)
  File "/Data/user/eliran/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5112, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "/Data/user/eliran/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5248, in duplicated
    raise KeyError(diff)
KeyError: Int64Index([1], dtype='int64')

What am I doing wrong?我究竟做错了什么？

Answer 1

Going over the source code in frame.py and the function duplicated查看 frame.py 中的源代码和frame.py duplicated

it seems that all the columns in your dataframe don't exist.您的 dataframe 中的所有列似乎都不存在。

`Class DataFrame()`

`def duplicated(self)` - snippet `def duplicated(self)` - 片段

  # Verify all columns in subset exist in the queried dataframe
        # Otherwise, raise a KeyError, same as if you try to __getitem__ with a
        # key that doesn't exist.
        diff = Index(subset).difference(self.columns)
        if not diff.empty:
            raise KeyError(diff)

df = pd.DataFrame({'col1' : [0,1,2], 'col3' : [1,2,3]})

print(df)

  col1  col3
0     0     1
1     1     2
2     2     3


df.drop_duplicates(subset=['col1','col2'])

   5246         diff = Index(subset).difference(self.columns)
   5247         if not diff.empty:
-> 5248             raise KeyError(diff)
   5249 
   5250         vals = (col.values for name, col in self.items() if name in subset)

KeyError: Index(['col2'], dtype='object')

Answer 2

I think problem here is not column 1 , because first column is converted to index , so some or all DataFrames has only one column called 0 .我认为这里的问题不是 column 1 ，因为第一列被转换为index ，所以一些或所有 DataFrames 只有一个名为0的列。

For prevent it use index_col=False parameter in read_csv :为了防止它在read_csv中使用index_col=False参数：

def read_dataframes(filenames, basedir):
    return [pd.read_csv(basedir + file, sep='\t', header=None, quoting=csv.QUOTE_NONE, index_col=False) for file in filenames]

Another problem should be for some reason there is only one column of data, so second column called 2 not exist.另一个问题应该是由于某种原因只有一列数据，所以第二列称为2不存在。

KeyError：使用 drop_duplicates 时的 Int64Index([1], dtype='int64')

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-03-10 11:26:17

`Class DataFrame()`

`def duplicated(self)` - snippet `def duplicated(self)` - 片段

解决方案2
1 2021-03-10 11:32:33

KeyError：使用 drop_duplicates 时的 Int64Index([1], dtype='int64')

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-03-10 11:26:17

Class DataFrame()

def duplicated(self) - snippet def duplicated(self) - 片段

解决方案2 1 2021-03-10 11:32:33

解决方案1
1 已采纳 2021-03-10 11:26:17

`Class DataFrame()`

`def duplicated(self)` - snippet `def duplicated(self)` - 片段

解决方案2
1 2021-03-10 11:32:33