I have a large dataset, a 200MB txt file where the data is separated with colons, which means that read_csv
fails when trying to parse the timestamp (and so it should).
Is there anyway I can ensure that pandas can correctly parse the timestamps without me cleaning/manipulating the data?
Here is an example of the issue.
import pandas as pd
from datetime import datetime
from io import StringIO
to_dt = lambda x: datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p")
ss = """first_name:date_registered
Philip:9/13/2020 12:03:05 AM"""
df = pd.read_csv(StringIO(ss), sep=":", parse_dates=["date_registered"], date_parser=to_dt)
print(df)
File "test.py", line 13, in <module>
df = pd.read_csv(StringIO(ss), sep=":", parse_dates=["date_registered"], date_parser=to_dt)
File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
return parser.read(nrows)
File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2103, in read
values = self._maybe_parse_dates(values, i, try_parse_dates=True)
File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2167, in _maybe_parse_dates
if try_parse_dates and self._should_parse_dates(index):
File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1439, in _should_parse_dates
j = self.index_col[i]
TypeError: 'NoneType' object is not subscriptable
The issue revolves around the fact that sep=":"
and that timestamps contain colons (if I was to change ss
to be delimited/separated by commas instead)
ss = """first_name,date_registered
Philip,9/13/2020 12:03:05 AM"""
and remove sep=":"
from read_csv
the issue goes away, but as mentioned, this isn't feasible due to the size of the dataset.
Edit : An example of one row is
12345:888888:Tom:Corn:builder:United Kingdom:London:four years:Travis:9/10/2017 12:00:00 AM::
The sep
argument supports regex. In this case
sep = "(?<=\D):"
would work well. This pattern matched the colons that do not have a digit behind them. You need to find a good pattern to separate values or just post more detail about your dataset.
UPDATE:
With the new example given I think excluding colons in time format will work. But again it depends on other rows of the dataset.
sep = ":(?!\d{,2}:\d{,2} [AP]M)(?!\d{,2} [AP]M)"
This pattern will match all colons but those that have time format in front. The time format in my pattern is (0 to 2 digits):(0 to 2 digits):(0 to 2 digits) AM or PM
I suggest loop over file line by line and replace first N occurrences and last M occurrences of the colon with a comma in each row. N and M would depend on the structure of your file. After this, you will be able to use pd.read_csv()
as usual with comma as separator.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.