[英]KeyError: Int64Index([1], dtype='int64') when using drop_duplicates
I wrote a simple script that supposed to merge (union) some dataframes and remove the duplicates.我写了一个简单的脚本,它应该合并(联合)一些数据框并删除重复项。
For example, For the input:例如,对于输入:
df_A:
a 1
b 2
df_B:
b 2
c 3
The expected output would be:预期的 output 将是:
df_out:
a 1
b 2
c 3
I wrote the following code:我写了以下代码:
def read_dataframes(filenames, basedir):
return [pd.read_csv(basedir + file, sep='\t', header=None, quoting=csv.QUOTE_NONE) for file in filenames]
def merge_dataframes(dfs, out):
merged = pd.concat(dfs).drop_duplicates(subset=[0, 1]).reset_index(drop=True)
merged = merged.iloc[:, [0, 1, 2, 7, 8, 9]]
merged.to_csv(out, header=None, index=None, sep='\t')
and I am calling these functions in the following manner:我以下列方式调用这些函数:
merge_dataframes(read_dataframes(filenames, basedir), output)
I am getting an exception of KeyError
:我遇到了KeyError
异常:
Traceback (most recent call last):
File "analysis_and_visualization.py", line 70, in <module>
merge_dataframes(read_dataframes(wild_emb, wild_basedir), 'wild_emb_merged')
File "analysis_and_visualization.py", line 17, in merge_dataframes
merged = pd.concat(dfs).drop_duplicates(subset=[0, 1]).reset_index(drop=True)
File "/Data/user/eliran/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5112, in drop_duplicates
duplicated = self.duplicated(subset, keep=keep)
File "/Data/user/eliran/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5248, in duplicated
raise KeyError(diff)
KeyError: Int64Index([1], dtype='int64')
What am I doing wrong?我究竟做错了什么?
Going over the source code in frame.py
and the function duplicated
查看 frame.py 中的源代码和frame.py
duplicated
it seems that all the columns in your dataframe don't exist.您的 dataframe 中的所有列似乎都不存在。
Class DataFrame()
def duplicated(self)
- snippet def duplicated(self)
- 片段 # Verify all columns in subset exist in the queried dataframe
# Otherwise, raise a KeyError, same as if you try to __getitem__ with a
# key that doesn't exist.
diff = Index(subset).difference(self.columns)
if not diff.empty:
raise KeyError(diff)
df = pd.DataFrame({'col1' : [0,1,2], 'col3' : [1,2,3]})
print(df)
col1 col3
0 0 1
1 1 2
2 2 3
df.drop_duplicates(subset=['col1','col2'])
5246 diff = Index(subset).difference(self.columns)
5247 if not diff.empty:
-> 5248 raise KeyError(diff)
5249
5250 vals = (col.values for name, col in self.items() if name in subset)
KeyError: Index(['col2'], dtype='object')
I think problem here is not column 1
, because first column is converted to index
, so some or all DataFrames has only one column called 0
.我认为这里的问题不是 column 1
,因为第一列被转换为index
,所以一些或所有 DataFrames 只有一个名为0
的列。
For prevent it use index_col=False
parameter in read_csv
:为了防止它在read_csv
中使用index_col=False
参数:
def read_dataframes(filenames, basedir):
return [pd.read_csv(basedir + file, sep='\t', header=None, quoting=csv.QUOTE_NONE, index_col=False) for file in filenames]
Another problem should be for some reason there is only one column of data, so second column called 2
not exist.另一个问题应该是由于某种原因只有一列数据,所以第二列称为2
不存在。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.