简体   繁体   中英

How can I iterate more efficiently through a large CSV file?

I have a number of large CSV files (each at around two million rows), which have rows of timestamps looking like this:

16.01.2019 12:52:22
16.01.2019 12:52:23
16.01.2019 12:52:24

Given that there's an entry for each second (over a course of about a year), it should be understandable why there are so many rows. I want to be more flexible, which is why I want to divide the timestamps into three rows: date, date+hour, date+hour+minute, date+hour+second, so that I'm able to group timestamps at will. This is how I'm doing it:

dates = []
hours = []
minutes = []
seconds = []
i = 0


#initial values
dates.append(str(get_date(i).date()))
hours.append(str(get_date(i).hour))
minutes.append(str(get_date(i).minute))
seconds.append(str(get_date(i).second))

for i in range(len(df)):
  if i < len(df) - 1 :
    if str(get_date(i).date) < str(get_date(i+1).date): #dates: YYYY-MM-DD
      dates.append(str(get_date(i+1).date()))
    else:
      dates.append(str(get_date(i).date()))

    if str(get_date(i).hour) < str(get_date(i+1).hour): #dates+hours: YYYY-MM-DD HH
      hours.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour))
    else:
      hours.append(str(get_date(i).date()) + " " + str(get_date(i).hour))

    if str(get_date(i).minute) < str(get_date(i+1).minute): #dates+hours+minutes: YYYY-MM-DD HH:mm
      minutes.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute))
    else: 
      minutes.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute))

    if str(get_date(i).second) < str(get_date(i+1).second): #dates+hours+minutes+seconds: YYYY-MM-DD HH:mm+ss
      seconds.append(str(get_date(i+1).date()) + " " + str(get_date(i+1).hour) + ":" + str(get_date(i+1).minute) + ":" + str(get_date(i+1).second))
    else: 
      seconds.append(str(get_date(i).date()) + " " + str(get_date(i).hour) + ":" + str(get_date(i).minute) + ":" + str(get_date(i).second))


df["dates"] = dates
df["hours"] = hours
df["minutes"] = minutes
df["seconds"] = seconds

where get_date() is simply a function returning the timestamp with the given index:

def get_date(i):
  return (dt.datetime.strptime(df["timestamp"][i], '%d.%m.%Y %H:%M:%S'))

I basically iterate through all entries, put each date/hour/minute/second into a list, and then insert them each into my dataframe.and put them into where get_date() is simply a function returning the timestamp with the given index.

I guess this would put me at O(n²) ? Which is obviously not ideal.

Now, doing this on one file (~60MB, 2 million rows) takes half an hour. I personally can't think of another way to do what I want to do, so I just wanted to see if there's anything I can do to reduce the complexity.

edit: Tweaking @Chris' answer for my needs:

times = bogie_df["timestamp"]

#got an error when applying map directly into pd.DataFrame, which is why I first converted it into a list
items = ['year', 'month', 'day', 'hour', 'minute', 'second']
df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(times))), columns=items)

#for my desired YYYY-MM-DD format (though attrgetter only return "1" for "January instead of "01"
df["date"] = df['year'].map(str) + "-" + df["month"].map(str) + df["day"].map(str) 

Use operator.attrgetter with pd.to_datetime :

import pandas as pd
import operator

s = pd.Series(["16.01.2019 12:52:22",
"16.01.2019 12:52:23",
"16.01.2019 12:52:24"])

items = ['day', 'hour', 'minute', 'second']
df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(s))), columns=items)

Output:

   day  hour  minute  second
0   16    12      52      22
1   16    12      52      23
2   16    12      52      24

Benchmark:

large_s = pd.Series(pd.date_range('16.01.2019 12:52:22', periods=2000000, freq='1s').astype(str).tolist())
# Make 2M rows of timestamp in str

%%timeit

items = ['day', 'hour', 'minute', 'second']
df = pd.DataFrame(list(map(operator.attrgetter(*items), pd.to_datetime(large_s))), columns=items)
# 6.77 s ± 54.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

takes about 7 seconds.

Updated:

Instead of manually creating a YYYY-mm-dd -formated strings, let pandas do it for you.

df = pd.DataFrame(pd.to_datetime(s), columns = ['date'])
items = ['year', 'month', 'day', 'hour', 'minute', 'second']
df[items] = pd.DataFrame(list(map(operator.attrgetter(*items), df['date'])), columns=items)

Then:

df['dates'] = df['date'].astype(str).str[:10]

Output (with the months padding with zero):

0    2019-01-16
1    2019-01-16
2    2019-01-16

Since comments are limited in length at least following points:

  1. Remove the if i < len(df) - 1 : , you don't need it. Replace your range with range(len(df)-1) .
  2. Save the results of your get_date function:

Before the loop: next_time = get_date(0)

Inside the loop:

current_time = next_time
next_time = get_date(i+1)

This should save you some function calls, but probably pandas has some better way to do such stuff.

You do not need that, but instead should add a single column of Timestamp type:

df['ts'] = pd.to_datetime(df.timestamp, format='%d.%m.%Y %H:%M:%S')

Then you can directly use all the time goodies on that column:

  • df.ts.dt.date : gives the date as a datetime.date
  • df.ts.dt.strftime(format) : gives the date as a string formatted with format. For exampe, df.ts.dt.strftime("YYYY-MM-DD HH:mm") is your "minutes" column
  • df.ts.dt.floor(freq='h') : is a timestamp truncated at hour level, for example for grouping
  • ... (just read pandas Datetime Data for reference)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM