简体   繁体   中英

pandas read_csv fails to parse timestamps because the separator is a colon

I have a large dataset, a 200MB txt file where the data is separated with colons, which means that read_csv fails when trying to parse the timestamp (and so it should).

Is there anyway I can ensure that pandas can correctly parse the timestamps without me cleaning/manipulating the data?

Here is an example of the issue.

import pandas as pd
from datetime import datetime
from io import StringIO

to_dt = lambda x: datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p")

ss = """first_name:date_registered
Philip:9/13/2020 12:03:05 AM"""

df = pd.read_csv(StringIO(ss), sep=":", parse_dates=["date_registered"], date_parser=to_dt)

print(df)
  File "test.py", line 13, in <module>
    df = pd.read_csv(StringIO(ss), sep=":", parse_dates=["date_registered"], date_parser=to_dt)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2103, in read
    values = self._maybe_parse_dates(values, i, try_parse_dates=True)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2167, in _maybe_parse_dates
    if try_parse_dates and self._should_parse_dates(index):
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1439, in _should_parse_dates
    j = self.index_col[i]
TypeError: 'NoneType' object is not subscriptable

The issue revolves around the fact that sep=":" and that timestamps contain colons (if I was to change ss to be delimited/separated by commas instead)

ss = """first_name,date_registered
Philip,9/13/2020 12:03:05 AM"""

and remove sep=":" from read_csv the issue goes away, but as mentioned, this isn't feasible due to the size of the dataset.

Edit : An example of one row is

12345:888888:Tom:Corn:builder:United Kingdom:London:four years:Travis:9/10/2017 12:00:00 AM::

The sep argument supports regex. In this case

sep = "(?<=\D):"

would work well. This pattern matched the colons that do not have a digit behind them. You need to find a good pattern to separate values or just post more detail about your dataset.

UPDATE:

With the new example given I think excluding colons in time format will work. But again it depends on other rows of the dataset.

sep = ":(?!\d{,2}:\d{,2} [AP]M)(?!\d{,2} [AP]M)"

This pattern will match all colons but those that have time format in front. The time format in my pattern is (0 to 2 digits):(0 to 2 digits):(0 to 2 digits) AM or PM

I suggest loop over file line by line and replace first N occurrences and last M occurrences of the colon with a comma in each row. N and M would depend on the structure of your file. After this, you will be able to use pd.read_csv() as usual with comma as separator.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM