简体   繁体   中英

How to skip bad lines when reading with dask?

I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column:

+-----------------------------+--------+----------+
| Column                      | Found  | Expected |
+-----------------------------+--------+----------+
| Pro_3FechaAprobacion        | object | int64    |
| Pro_3FechaCancelContractual | object | int64    |
| Pro_3FechaDesembolso        | object | int64    |
+-----------------------------+--------+----------+

The following columns also raised exceptions on conversion:

- Pro_3FechaAprobacion
  ValueError("invalid literal for int() with base 10: '200904XX'")
- Pro_3FechaCancelContractual
  ValueError("invalid literal for int() with base 10: '        '")
- Pro_3FechaDesembolso
  ValueError("invalid literal for int() with base 10: '200904XX'")

I know these are date columns, and they are formatted like %Y%m%d but some records are like %Y%mXX. I want to skip these as when I use:

df = pd.read_csv("file.txt",error_bad_lines=False)

Is there any way to this in dask?

The error_bad_lines=False keyword is taken from pandas.read_csv . I don't think that it supports the behavior that you want. You might consider asking this same question with the pandas tag instead to see if people familiar with Pandas' read_csv function can provide some suggestions. The dask.dataframe.read_csv function just uses that code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM