简体   繁体   中英

Python: Replace values while reading CSV file

I have a CSV file with several columns that include integers and a string. Naturally, I get a dtype warning because of the mixed dtypes. I read the file with this general command.

df = pd.read_csv(path, sep=";", na_values=missing)

I could use low_memory=False or dtype=object to silence the warning but as far as I know this makes reading my file not more memory efficient.

I could also use na_values="my_string" but I have other missing values (which are supposed to be real missing values) and do not want to mix them.

I do not need the value of the string but only its value count so I thought of replacing it with an integer. Something like this.

df.replace(to_replace="my_string", value=999)

However, is it also possible to replace a value while reading a CSV file? Or does another solution exist? I do not want to simply silence the warning but find a solution which is more memory efficient.

(I know about this answer but it does not really help me with my problem.)

You can use converters :

In [156]: def conv(val, default_val=999):
     ...:     try:
     ...:         return int(val)
     ...:     except ValueError:
     ...:         return default_val
     ...:

In [157]: conv('a')
Out[157]: 999

In [158]: pd.read_csv(r'C:\Temp\test.csv', converters={'a':conv})
Out[158]:
     a   b           c
0    1  11  2000-01-01
1  999  12  2000-01-02
2    3  13  2000-01-02

Another approach, would be to convert columns in a vectorized way after parsing a CSV file:

In [166]: df = pd.read_csv(r'C:\Temp\test.csv', parse_dates=['c'])

In [167]: df
Out[167]:
     a    b          c
0    1  AAA 2000-01-01
1  XXX   12 2000-01-02
2    3   13 2000-01-02

In [168]: df.dtypes
Out[168]:
a            object
b            object
c    datetime64[ns]
dtype: object

In [169]: int_cols = ['a','b']

In [170]: df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').fillna(999).astype(int)

In [171]: df
Out[171]:
     a    b          c
0    1  999 2000-01-01
1  999   12 2000-01-02
2    3   13 2000-01-02

In [172]: df.dtypes
Out[172]:
a             int32
b             int32
c    datetime64[ns]
dtype: object

Speed comparison for 300.000 rows DF:

In [175]: df = pd.concat([df] * 10**5, ignore_index=True)

In [176]: df.shape
Out[176]: (300000, 3)

In [177]: filename = r'C:\Temp\test.csv'

In [184]: df.to_csv(filename, index=False)

In [185]: %%timeit
     ...: df = pd.read_csv(filename, parse_dates=['c'], converters={'a':conv, 'b':conv})
     ...:
632 ms ± 25.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [186]: %%timeit
     ...: df = pd.read_csv(filename, parse_dates=['c'])
     ...: df[int_cols] = df[int_cols].apply(pd.to_numeric, errors='coerce').fillna(999).astype(int)
     ...:
706 ms ± 60.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's no possible to replace de values while you are reading a CSV file. You have to replace once you load the data and save it. Then you don't get warning anymore.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM