简体   繁体   English

pandas read_csv 无法解析时间戳,因为分隔符是冒号

[英]pandas read_csv fails to parse timestamps because the separator is a colon

I have a large dataset, a 200MB txt file where the data is separated with colons, which means that read_csv fails when trying to parse the timestamp (and so it should).我有一个大数据集,一个 200MB 的 txt 文件,其中数据用冒号分隔,这意味着read_csv在尝试解析时间戳时失败(应该如此)。

Is there anyway I can ensure that pandas can correctly parse the timestamps without me cleaning/manipulating the data?无论如何我可以确保 pandas 可以正确解析时间戳,而无需我清理/操作数据?

Here is an example of the issue.这是该问题的一个示例。

import pandas as pd
from datetime import datetime
from io import StringIO

to_dt = lambda x: datetime.strptime(x, "%m/%d/%Y %I:%M:%S %p")

ss = """first_name:date_registered
Philip:9/13/2020 12:03:05 AM"""

df = pd.read_csv(StringIO(ss), sep=":", parse_dates=["date_registered"], date_parser=to_dt)

print(df)
  File "test.py", line 13, in <module>
    df = pd.read_csv(StringIO(ss), sep=":", parse_dates=["date_registered"], date_parser=to_dt)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2103, in read
    values = self._maybe_parse_dates(values, i, try_parse_dates=True)
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 2167, in _maybe_parse_dates
    if try_parse_dates and self._should_parse_dates(index):
  File "/home/mark/uk/.venv/lib/python3.8/site-packages/pandas/io/parsers.py", line 1439, in _should_parse_dates
    j = self.index_col[i]
TypeError: 'NoneType' object is not subscriptable

The issue revolves around the fact that sep=":" and that timestamps contain colons (if I was to change ss to be delimited/separated by commas instead)问题围绕着sep=":"和时间戳包含冒号这一事实(如果我要将ss改为用逗号分隔/分隔)

ss = """first_name,date_registered
Philip,9/13/2020 12:03:05 AM"""

and remove sep=":" from read_csv the issue goes away, but as mentioned, this isn't feasible due to the size of the dataset.并从read_csv中删除sep=":"问题就消失了,但如上所述,由于数据集的大小,这是不可行的。

Edit : An example of one row is编辑:一行的一个例子是

12345:888888:Tom:Corn:builder:United Kingdom:London:four years:Travis:9/10/2017 12:00:00 AM:: 12345:888888:Tom:Corn:builder:United Kingdom:London:four years:Travis:9/10/2017 12:00:00 AM::

The sep argument supports regex. sep参数支持正则表达式。 In this case在这种情况下

sep = "(?<=\D):"

would work well.会很好用。 This pattern matched the colons that do not have a digit behind them.此模式匹配后面没有数字的冒号。 You need to find a good pattern to separate values or just post more detail about your dataset.您需要找到一个好的模式来分隔值,或者只是发布有关您的数据集的更多详细信息。

UPDATE:更新:

With the new example given I think excluding colons in time format will work.对于给出的新示例,我认为以时间格式排除冒号会起作用。 But again it depends on other rows of the dataset.但它又取决于数据集的其他行。

sep = ":(?!\d{,2}:\d{,2} [AP]M)(?!\d{,2} [AP]M)"

This pattern will match all colons but those that have time format in front.此模式将匹配所有冒号,但前面有时间格式的冒号。 The time format in my pattern is (0 to 2 digits):(0 to 2 digits):(0 to 2 digits) AM or PM我的模式中的时间格式是(0 到 2 位):(0 到 2 位):(0 到 2 位)AM 或 PM

I suggest loop over file line by line and replace first N occurrences and last M occurrences of the colon with a comma in each row.我建议逐行遍历文件,并在每行中用逗号替换前 N 次出现和最后 M 次出现的冒号。 N and M would depend on the structure of your file. N 和 M 取决于文件的结构。 After this, you will be able to use pd.read_csv() as usual with comma as separator.在此之后,您将能够像往常一样使用pd.read_csv()以逗号作为分隔符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM