简体   繁体   中英

How to deal with metadata lines in pandas.read_csv?

I have a txt file that has a header of metadata followed by the actual data in csv style. The data contains floats with commas. Like this:

title = someTitle
date = 20.0.2019
col= str1 str2 str3
2,49 42,01 -0,50
5,74 11,03 -0,43
....

I need the whole information in pandas (0.24.0) and want the data as floats.

df = pd.read_csv(path,sep='\t',decimal=',',names=[i for i in range(3)])

In this case, the decimal option makes no difference. I always get strings. Without the metadata, it works perfect. eg by:

pd.read_csv(...,skiprows=3)

To me, it seems like pandas assume the type of the rows by the first lines.

So how can tell pandas to ignore the metadata?

read_csv can read from a file like object, so you should open the file, read 3 rows as headers, extract the column names and optionaly use them in read_csv . In addition, you can force the datatype with the dtype option. Code could be:

with open(path) as fd:
    headers = [ next(fd) for i in range(3) ]
    df = pd.read_csv(fd, sep=' ', decimal=',', dtype=np.float, names=...)

You can use the header part to set the column names if you want:

with open(path) as fd:
    headers = [ next(fd) for i in range(3) ]
    cols = headers[2].split('=', 1)[1].strip().split(' ')
    df = pd.read_csv(fd, sep=' ', decimal=',', dtype=np.float, names=cols)

You would get:

   str1   str2  str3
0  2.49  42.01 -0.50
1  5.74  11.03 -0.43

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM