I am trying to read a .txt with dask (7 million rows approximately). However, there are like 4000 rows that mismatch the dtype of the column:
+-----------------------------+--------+----------+
| Column | Found | Expected |
+-----------------------------+--------+----------+
| Pro_3FechaAprobacion | object | int64 |
| Pro_3FechaCancelContractual | object | int64 |
| Pro_3FechaDesembolso | object | int64 |
+-----------------------------+--------+----------+
The following columns also raised exceptions on conversion:
- Pro_3FechaAprobacion
ValueError("invalid literal for int() with base 10: '200904XX'")
- Pro_3FechaCancelContractual
ValueError("invalid literal for int() with base 10: ' '")
- Pro_3FechaDesembolso
ValueError("invalid literal for int() with base 10: '200904XX'")
I know these are date columns, and they are formatted like %Y%m%d but some records are like %Y%mXX. I want to skip these as when I use:
df = pd.read_csv("file.txt",error_bad_lines=False)
Is there any way to this in dask?
The error_bad_lines=False
keyword is taken from pandas.read_csv
. I don't think that it supports the behavior that you want. You might consider asking this same question with the pandas
tag instead to see if people familiar with Pandas' read_csv function can provide some suggestions. The dask.dataframe.read_csv
function just uses that code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.