I wrote a simple script that supposed to merge (union) some dataframes and remove the duplicates.
For example, For the input:
df_A:
a 1
b 2
df_B:
b 2
c 3
The expected output would be:
df_out:
a 1
b 2
c 3
I wrote the following code:
def read_dataframes(filenames, basedir):
return [pd.read_csv(basedir + file, sep='\t', header=None, quoting=csv.QUOTE_NONE) for file in filenames]
def merge_dataframes(dfs, out):
merged = pd.concat(dfs).drop_duplicates(subset=[0, 1]).reset_index(drop=True)
merged = merged.iloc[:, [0, 1, 2, 7, 8, 9]]
merged.to_csv(out, header=None, index=None, sep='\t')
and I am calling these functions in the following manner:
merge_dataframes(read_dataframes(filenames, basedir), output)
I am getting an exception of KeyError
:
Traceback (most recent call last):
File "analysis_and_visualization.py", line 70, in <module>
merge_dataframes(read_dataframes(wild_emb, wild_basedir), 'wild_emb_merged')
File "analysis_and_visualization.py", line 17, in merge_dataframes
merged = pd.concat(dfs).drop_duplicates(subset=[0, 1]).reset_index(drop=True)
File "/Data/user/eliran/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5112, in drop_duplicates
duplicated = self.duplicated(subset, keep=keep)
File "/Data/user/eliran/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 5248, in duplicated
raise KeyError(diff)
KeyError: Int64Index([1], dtype='int64')
What am I doing wrong?
Going over the source code in frame.py
and the function duplicated
it seems that all the columns in your dataframe don't exist.
Class DataFrame()
def duplicated(self)
- snippet # Verify all columns in subset exist in the queried dataframe
# Otherwise, raise a KeyError, same as if you try to __getitem__ with a
# key that doesn't exist.
diff = Index(subset).difference(self.columns)
if not diff.empty:
raise KeyError(diff)
df = pd.DataFrame({'col1' : [0,1,2], 'col3' : [1,2,3]})
print(df)
col1 col3
0 0 1
1 1 2
2 2 3
df.drop_duplicates(subset=['col1','col2'])
5246 diff = Index(subset).difference(self.columns)
5247 if not diff.empty:
-> 5248 raise KeyError(diff)
5249
5250 vals = (col.values for name, col in self.items() if name in subset)
KeyError: Index(['col2'], dtype='object')
I think problem here is not column 1
, because first column is converted to index
, so some or all DataFrames has only one column called 0
.
For prevent it use index_col=False
parameter in read_csv
:
def read_dataframes(filenames, basedir):
return [pd.read_csv(basedir + file, sep='\t', header=None, quoting=csv.QUOTE_NONE, index_col=False) for file in filenames]
Another problem should be for some reason there is only one column of data, so second column called 2
not exist.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.