简体   繁体   中英

Pandas enumerate columns unexpected behavior

I'm using python to extract tables from some PDFs with tabula. Every table is then converted to a Pandas DataFrame, and I have to perform some analysis on them. I want to iterate every column to see if they contain a particular string, but I noticed an unexpected behavior in one particular df (at least I'm not able to understand what's going on).

This are the columns of the DataFrame, obtained with df.columns ( df is the name of the name of the DataFrame):

 Index(['cognome:xxxxnome:xxxxxprovenienza: esterno\r\rcodice fiscale: xxxxx\rdata valutazione neuropsicologica: 25/03/2021\rdata di nascita: 08/09/1955\retà (anni compiuti): 65\rsesso: m\rnumero anni di scolarità: 13', 'unnamed: 0'], dtype='object')

So, from what I see here, the name of the 0-th column should be

'cognome:xxxxnome:xxxxxprovenienza: esterno\r\rcodice fiscale: xxxxx\rdata valutazione neuropsicologica: 25/03/2021\rdata di nascita: 08/09/1955\retà (anni compiuti): 65\rsesso: m\rnumero anni di scolarità: 13'

What I don't understand is that, if I try to iterate through the columns of df , this is what happens:

for i, col in enumerate(list(df.columns)):
    print(f'{i}-th loop, column name = {col}')

Output:

 numero anni di scolarità: 13ogica: 25/03/2021xxxxprovenienza: esterno 1-th loop, column name = unnamed: 0

So here are my questions:

  1. Why the index of the 0-th loop is not printed?
  2. Why the printed value of col for the 0-th loop is different from the 0-th element of df.columns ?

Some more details about df:

 <class 'pandas.core.frame.DataFrame'> 
 Int64Index: 0 entries 
 Data columns (total 2 columns):  
 #   Column                                
 Non-Null Count  Dtype  
 ---  ------
 --------------  -----   
 numero anni di scolarità: 13  0 non-null      float64  
 1   unnamed: 0                                               
 0 non-null      float64 
 dtypes: float64(2) 
 memory usage: 0.0 bytes

I'm using Jupyter Notebbok with Pandas version 1.2.0

The problem is with the carriage returns \r which your column name is full of. When you print the string, every time a \r is seen, you start from the beginning of the line, overwriting character by character. So the index 0 gets printed, but then overwritten.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM