I am interpolating arrival times for some public transportation data I have. I have a working script, but it seems to be running in quadratic time. Here is the script:
import pandas as pd
#read the txt file
st = pd.read_csv('interpolated_test.csv')
# sort first by trip_id, then by stop_sequence
sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])
# reset the index values in prep. for iteration
reindexed = sorted_st.reset_index(drop=True)
# for each row in 'arrival_time' that has a value of hh:mm:ss
for i in reindexed['arrival_time']:
# for i in range(len(reindexed['arrival_time'])):
if pd.isnull(i) == False:
# splice hh:mm:ss
hour = int(i[:2])
minute = int(i[3:5])
# assign hh:mm:ss to numeric value
minute_value = (hour * 60) + minute
# replace current string with int value
# takes ~655s to execute on Macbook Pro w/ entire stop_times.txt
# runs in quadratic time
reindexed = reindexed.replace(i,minute_value)
# interpolate and write out
new = reindexed.apply(pd.Series.interpolate)
print(new)
Here is a link to the csv: https://gist.github.com/adampitchie/0192933ed0eba122ba7e
I shortened the csv so you can run the file without waiting for it to finish.
This should be low-hanging fruit for anybody familiar with pandas, but I'm stuck and any help would be appreciated.
[UPDATE] So I tried running the same code with the FULL CSV FILE , and I get this error:
Traceback (most recent call last):
File "/Users/tester/Desktop/ETL/interpolate.py", line 49, in <module>
reindexed[col].dt.hour * 60
File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664)
File "/Library/Python/2.7/site-packages/pandas/core/series.py", line 2513, in dt
raise TypeError("Can only use .dt accessor with datetimelike values")
TypeError: Can only use .dt accessor with datetimelike values
It looks like pd.to_datetime(reindexed[col])
is not working. Here is the code, for the sake of completedness:
import pandas as pd
st = pd.read_csv('csv/stop_times.csv')
sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])
reindexed = sorted_st.reset_index(drop=True)
for col in ('arrival_time', 'departure_time'):
reindexed[col] = pd.to_datetime(reindexed[col])
reindexed[col] = (
reindexed[col].dt.hour * 60
+ reindexed[col].dt.minute)
reindexed[col] = reindexed[col].interpolate()
print(reindexed.iloc[:, :3])
Whenever you can, try to phrase computations as operations on whole columns rather than rows, or item-by-item. Instead of handling each value in reindexed['arrival_time']
one at a time, you can convert the whole column into datetime64
s using pd.to_datetime
. A Series of datetime64
s has a dt
attribute which allows you to access the hour and minutes as integers. So you can express the calculation for the whole column like this:
for col in ('arrival_time', 'departure_time'):
reindexed[col] = pd.to_datetime(reindexed[col])
reindexed[col] = (
reindexed[col].dt.hour * 60
+ reindexed[col].dt.minute)
reindexed[col] = reindexed[col].interpolate()
print(reindexed.iloc[:5, :3])
yields
trip_id arrival_time departure_time
0 1423492 647.000000 647.000000
1 1423492 649.666667 649.666667
2 1423492 652.333333 652.333333
3 1423492 655.000000 655.000000
4 1423492 655.750000 655.750000
Debugging TypeError: Can only use .dt accessor with datetimelike values
:
Indeed, as you pointed out, pd.to_datetime
is not converting the times to datetime64s. Instead, it is just returning the same data as strings. pd.to_datetime
returns the input when it encounters an error trying to convert the input to datetimes. You can gather a bit more information about what is going wrong by adding the errors='raise'
parameter:
pd.to_datetime(reindexed['arrival_time'], errors='raise')
raises
ValueError: hour must be in 0..23
So aha -- the time format probably has times whose hours exceed 23.
Using
col = 'arrival_time'
x = reindexed[col]
mask = x.str.extract(r'(\d+):(\d+):(\d+)')[0].astype('int') > 23
we can see examples of rows where the hours is greater than 23:
In [48]: x[mask].head()
Out[48]:
42605 26:09:00
42610 26:12:00
42611 26:20:00
42612 26:30:00
42613 26:35:00
Name: arrival_time, dtype: object
The x.str.extract splits the arrival time strings using the regex pattern r'(\\d+):(\\d+):(\\d+)'
. It returns a DataFrame with three columns.
This piece of debugging code suggests a workaround. Instead of pd.to_datetime
, we could use x.str.extract
to find the hours and minutes:
import pandas as pd
st = pd.read_csv('csv/stop_times.csv')
sorted_st = st.sort(['trip_id','stop_sequence'], ascending=[False,True])
reindexed = sorted_st.reset_index(drop=True)
for col in ('arrival_time', 'departure_time'):
df = reindexed[col].str.extract(
r'(?P<hour>\d+):(?P<minute>\d+):(?P<second>\d+)').astype('float')
reindexed[col] = df['hour'] * 60 + df['minute']
reindexed[col] = reindexed[col].interpolate()
print(reindexed.iloc[:5, :3])
yields
trip_id arrival_time departure_time
0 1423492 647.000000 647.000000
1 1423492 649.666667 649.666667
2 1423492 652.333333 652.333333
3 1423492 655.000000 655.000000
4 1423492 655.750000 655.750000
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.