优化Python代码

Question

I have written the following code to preprocess a dataset like this: 我编写了以下代码来预处理这样的数据集：

StartLocation   StartTime   EndTime
school          Mon Jul 25 19:04:30 GMT+01:00 2016  Mon Jul 25 19:04:33 GMT+01:00 2016
...             ...         ...

It contains a list of locations attended by a user with the start and end time. 它包含用户参加的位置列表以及开始时间和结束时间。 Each location may occur several times and there is no comprehensive list of locations. 每个位置可能会出现几次，并且没有位置的完整列表。 From this, I want to aggregate data for each location (frequency, total time, mean time). 由此，我想汇总每个位置的数据（频率，总时间，平均时间）。 To do this I have written the following code: 为此，我编写了以下代码：

def toEpoch(x):
    try:
        x = datetime.strptime(re.sub(r":(?=[^:]+$)", "", x), '%a %b %d %H:%M:%S %Z%z %Y').strftime('%s')
    except:
        x = datetime.strptime(x, '%a %b %d %H:%M:%S %Z %Y').strftime('%s')
    x = (int(x)/60)
    return x

#Preprocess data
df = pd.read_csv('...')
for index, row in df.iterrows():
    df['StartTime'][index] = toEpoch(df['StartTime'][index])
    df['EndTime'][index] = toEpoch(df['EndTime'][index])
    df['TimeTaken'][index] = int(df['EndTime'][index]) - int(df['StartTime'][index])
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)

This code functions correctly, however is quite inefficient. 该代码可以正常运行，但是效率很低。 How can I optimise the code? 如何优化代码？

EDIT: Based on @Batman's helpful comments I no longer iterate. 编辑：基于@Batman的有用评论，我不再重复。 However, I still hope to further optimise this if possible. 但是，如果可能的话，我仍然希望进一步优化它。 The updated code is: 更新的代码是：

df = pd.read_csv('...')
df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']
total = df.groupby(df['StartLocation'].str.lower()).sum()
av = df.groupby(df['StartLocation'].str.lower()).mean()
count = df.groupby(df['StartLocation'].str.lower()).count()
output = pd.DataFrame({"location": total.index, 'total': total['TimeTaken'], 'mean': av['TimeTaken'], 'count': count['TimeTaken']})
print(output)

Answer 1

First thing I'd do is stop iterating over the rows. 我要做的第一件事是停止对行进行迭代。

df['StartTime'] = df['StartTime'].apply(toEpoch)
df['EndTime'] = df['EndTime'].apply(toEpoch)
df['TimeTaken'] = df['EndTime'] - df['StartTime']

Then, do a single groupby operation. 然后，执行单个groupby操作。

gb = df.groupby('StartLocation')
total = gb.sum()
av = gb.mean()
count = gb.count()

Answer 2

vectorize the date conversion 向量化日期转换
take the difference of two series of timestamps gives a series of timedeltas 取两个系列时间戳的差给出一系列时间戳
use total_seconds to get the seconds from the the timedeltas 使用total_seconds从timedeltas中获取秒
groupby with agg groupby与agg

# convert dates
cols = ['StartTime', 'EndTime']
df[cols] = pd.to_datetime(df[cols].stack()).unstack()

# generate timedelta then total_seconds via the `dt` accessor
df['TimeTaken'] = (df.EndTime - df.StartTime).dt.total_seconds()

# define the lower case version for cleanliness
loc_lower = df.StartLocation.str.lower()

# define `agg` functions for cleanliness
# this tells `groupby` to use 3 functions, sum, mean, and count
# it also tells what column names to use
funcs = dict(Total='sum', Mean='mean', Count='count')
df.groupby(loc_lower).TimeTaken.agg(funcs).reset_index()

explanation of date conversion 日期转换说明

I define cols for convenience 我为了方便定义cols
df[cols] = is an assignment to those two columns df[cols] =是对这两列的赋值
pd.to_datetime() is a vectorized date converter but only takes pd.Series not pd.DataFrame pd.to_datetime()是向量化日期转换器，但仅使用pd.Series而不使用pd.DataFrame
df[cols].stack() makes the 2-column dataframe into a series, now ready for pd.to_datetime() df[cols].stack()将2列数据帧分成一系列，现在可以使用pd.to_datetime()
use pd.to_datetime(df[cols].stack()) as described and unstack() to get back my 2-columns and now ready to be assigned. 使用pd.to_datetime(df[cols].stack())所描述的和unstack()找回我的2列，现在已经准备好进行分配。

优化Python代码

问题描述

2 个解决方案

解决方案1
2 2017-01-23 01:04:03

解决方案2
2 2017-01-23 01:10:44

优化Python代码

问题描述

2 个解决方案

解决方案1 2 2017-01-23 01:04:03

解决方案2 2 2017-01-23 01:10:44

解决方案1
2 2017-01-23 01:04:03

解决方案2
2 2017-01-23 01:10:44