Trouble optimizing python interpolation script

Question

I am interpolating arrival times for some public transportation data I have. I have a working script, but it seems to be running in quadratic time. Here is the script:

import pandas as pd

#read the txt file
st = pd.read_csv('interpolated_test.csv')

# sort first by trip_id, then by stop_sequence
sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

# reset the index values in prep. for iteration
reindexed = sorted_st.reset_index(drop=True)

# for each row in 'arrival_time' that has a value of hh:mm:ss
for i in reindexed['arrival_time']:
# for i in range(len(reindexed['arrival_time'])):
    if pd.isnull(i) == False:
        # splice hh:mm:ss
        hour = int(i[:2])
        minute = int(i[3:5])
        # assign hh:mm:ss to numeric value
        minute_value = (hour * 60) + minute

        # replace current string with int value
        # takes ~655s to execute on Macbook Pro w/ entire stop_times.txt
        # runs in quadratic time
        reindexed = reindexed.replace(i,minute_value)

# interpolate and write out
new = reindexed.apply(pd.Series.interpolate)
print(new)

Here is a link to the csv: https://gist.github.com/adampitchie/0192933ed0eba122ba7e

I shortened the csv so you can run the file without waiting for it to finish.

This should be low-hanging fruit for anybody familiar with pandas, but I'm stuck and any help would be appreciated.

[UPDATE] So I tried running the same code with the FULL CSV FILE , and I get this error:

Traceback (most recent call last):
  File "/Users/tester/Desktop/ETL/interpolate.py", line 49, in <module>
    reindexed[col].dt.hour * 60
  File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664)
  File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 2513, in dt
    raise TypeError("Can only use .dt accessor with datetimelike values")
TypeError: Can only use .dt accessor with datetimelike values

It looks like pd.to_datetime(reindexed[col]) is not working. Here is the code, for the sake of completedness:

import pandas as pd

st = pd.read_csv('csv/stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

reindexed = sorted_st.reset_index(drop=True)

for col in ('arrival_time', 'departure_time'):
    reindexed[col] = pd.to_datetime(reindexed[col])
    reindexed[col] = (
        reindexed[col].dt.hour * 60
        + reindexed[col].dt.minute)
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:, :3])

Answer 1

Whenever you can, try to phrase computations as operations on whole columns rather than rows, or item-by-item. Instead of handling each value in reindexed['arrival_time'] one at a time, you can convert the whole column into datetime64 s using pd.to_datetime . A Series of datetime64 s has a dt attribute which allows you to access the hour and minutes as integers. So you can express the calculation for the whole column like this:

for col in ('arrival_time', 'departure_time'):
    reindexed[col] = pd.to_datetime(reindexed[col])
    reindexed[col] = (
        reindexed[col].dt.hour * 60
        + reindexed[col].dt.minute)
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:5, :3])

yields

    trip_id  arrival_time  departure_time
0   1423492    647.000000      647.000000
1   1423492    649.666667      649.666667
2   1423492    652.333333      652.333333
3   1423492    655.000000      655.000000
4   1423492    655.750000      655.750000

Debugging TypeError: Can only use .dt accessor with datetimelike values :

Indeed, as you pointed out, pd.to_datetime is not converting the times to datetime64s. Instead, it is just returning the same data as strings. pd.to_datetime returns the input when it encounters an error trying to convert the input to datetimes. You can gather a bit more information about what is going wrong by adding the errors='raise' parameter:

pd.to_datetime(reindexed['arrival_time'], errors='raise')

raises

ValueError: hour must be in 0..23

So aha -- the time format probably has times whose hours exceed 23.

Using

col = 'arrival_time'
x = reindexed[col]
mask = x.str.extract(r'(\d+):(\d+):(\d+)')[0].astype('int')  > 23

we can see examples of rows where the hours is greater than 23:

In [48]: x[mask].head()
Out[48]: 
42605    26:09:00
42610    26:12:00
42611    26:20:00
42612    26:30:00
42613    26:35:00
Name: arrival_time, dtype: object

The x.str.extract splits the arrival time strings using the regex pattern r'(\\d+):(\\d+):(\\d+)' . It returns a DataFrame with three columns.

This piece of debugging code suggests a workaround. Instead of pd.to_datetime , we could use x.str.extract to find the hours and minutes:

import pandas as pd

st = pd.read_csv('csv/stop_times.csv')

sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])

reindexed = sorted_st.reset_index(drop=True)

for col in ('arrival_time', 'departure_time'):
    df = reindexed[col].str.extract(
        r'(?P<hour>\d+):(?P<minute>\d+):(?P<second>\d+)').astype('float')
    reindexed[col] = df['hour'] * 60 + df['minute']
    reindexed[col] = reindexed[col].interpolate()

print(reindexed.iloc[:5, :3])

yields

   trip_id  arrival_time  departure_time
0  1423492    647.000000      647.000000
1  1423492    649.666667      649.666667
2  1423492    652.333333      652.333333
3  1423492    655.000000      655.000000
4  1423492    655.750000      655.750000

Trouble optimizing python interpolation script

Question

1 answers

solution1
0 ACCPTED 2014-12-13 01:03:34

Trouble optimizing python interpolation script

Question

1 answers

solution1 0 ACCPTED 2014-12-13 01:03:34

solution1
0 ACCPTED 2014-12-13 01:03:34