简体   繁体   中英

Python / Pandas - drop_duplicates ValueError

I have a very large dataframe. When I run: df=df.drop_duplicates() I get the following error:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

If I run a df.loc[:10].drop_duplicates() it already bugs

Does anyone know what can cause that?

EDIT

The dataframe looks like this:

                                                  Razao_social  Razao_social  \
business_id                                                                    

17                                             MULTIGRAIN S.A.     Sao Paulo   
17                                             MULTIGRAIN S.A.     Sao Paulo   
17                                             MULTIGRAIN S.A.     Sao Paulo   
17                                             MULTIGRAIN S.A.     Sao Paulo   
17                                             MULTIGRAIN S.A.     Sao Paulo   
17                                             MULTIGRAIN S.A.     Sao Paulo   
38           BRASILAGRO - COMPANHIA BRASILEIRA DE PROPRIEDA...     Sao Paulo   
38           BRASILAGRO - COMPANHIA BRASILEIRA DE PROPRIEDA...     Sao Paulo   
71                                    SECURITAS GARANTIAS S.A.     Sao Paulo   
71                                    SECURITAS GARANTIAS S.A.     Sao Paulo   
71                                    SECURITAS GARANTIAS S.A.     Sao Paulo   
71                                    SECURITAS GARANTIAS S.A.     Sao Paulo   

Without knowing more about the dataframe, I'm going to give some generic thoughts: - there was a known bug in pandas 0.18 ( https://github.com/pandas-dev/pandas/issues/13393 ) that caused a buffer value error with MultiIndexes that contained a datetime64 data type. Is one of your columns of this type? - Do any of your columns have duplicate names? I know this isn't supposed to happen but it does. - Do you need to look for duplicates across all columns, or will a subset of columns suffice? Try using the subset= option in the method call.

The answer by Vico might be helpful, but with a very large dataframe, doing a transpose of both the initial dataframe and of the dataframe with dropped duplicates might be more resources than can be allocated.

Another way to run into this issue (and end up at this question through google ;) ) is sparse columns... See the error below:

n = 10000
df = pd.DataFrame({'a': np.random.choice(n*2, n),
          'b': np.random.choice(10, n),
          'c': np.random.choice(4, n),
          'd': np.random.choice(int(n/2), n),
          'e': np.random.choice(int(n/100), n)})
df_dummies = pd.get_dummies(df, columns=['b', 'c'], sparse=True)
df_dummies.drop_duplicates()

The strange thing is that df_dummies.to_dense() won't solve your issue - recreating your dummies with sparse=False does.

To check for duplicate columns (the problem suggested by the other answers), you can by the way use this snippet:

df.columns.duplicated()

I faced the same issue with very large dataframe, then the issue solved by creating temporary column that concatenate all the key columns in one string then remove the duplicate based on that unique added column, then dropping off this temporarly column after finish removing the duplicate.

df['concatenated_all'] = df['col1'].astype(str)+'_'+df['col2'].astype(str)+'_'+df['col3'].astype(str)+'_'+df['col4'].astype(str)+'_'+df['coln'].astype(str)

df = df.drop_duplicates(subset='concatenated_all', keep="last")

df = df.drop(['concatenated_all'], axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM