简体   繁体   English

加快时间戳操作

[英]Speeding up timestamp operations

The following transformation (ms -> datetime -> conver timezone) takes a long time to run (4 minutes), probably because I am working with a large dataframe: 以下转换(ms - > datetime - > conver timezone)需要很长时间才能运行(4分钟),可能是因为我正在使用大型数据帧:

for column in ['A', 'B', 'C', 'D', 'E']:
    # Data comes in unix time (ms) so I need to convert it to datetime
    df[column] = pd.to_datetime(df[column], unit='ms')

    # Get times in EST
    df[column] = df[column].apply(lambda x: x.tz_localize('UTC').tz_convert('US/Eastern'))

Is there any way to speed it up? 有没有办法加快速度? Am I already using Pandas data structures and methods in the most efficient manner? 我是否已经以最有效的方式使用Pandas数据结构和方法?

These are available as DatetimeIndex methods which will be much faster: 这些都可以作为DatetimeIndex方法,这将是更快

df[column] = pd.DatetimeIndex(df[column]).tz_localize('UTC').tz_convert('US/Eastern')

Note: In 0.15.0 you'll have access to these as Series dt accessor : 注意:在0.15.0中,您可以访问这些作为Series dt访问器

df[column] = df[column].dt.tz_localize('UTC').tz_convert('US/Eastern')

I would give this a try in Bash using the date command. 我会使用date命令在Bash中尝试一下。 Date proves to be faster than even gawk for routine conversions. 对于例行转换,日期证明比甚至更快。 Python may struggle with this. Python可能会对此感到困惑。

To speed this up even faster export the column in A in one temp file, column B in another, ect. 为了加快这一步,可以更快地将A中的列导出到一个临时文件中,将列B导出到另一个临时文件中,等等。 (You can even do this in python). (你甚至可以在python中这样做)。 Then run the 5 columns in parallel. 然后并行运行5列。

for column in ['A']:
  print>>thefileA, column
for column in ['B']:
  print>>thefileB, column

Then a bash script: 然后是一个bash脚本:

#!/usr/bin/env bash
readarray a < thefileA
for i in $( a ); do
    date -r item: $i
done

You'll want a master bash script to run the first part in python python pythonscript.py . 你需要一个主bash脚本来运行python python pythonscript.py的第一部分。 Then you will want to call in each of the bash scripts in the background from the master ./FILEA.sh & . 然后,您将要从主./FILEA.sh &调用后台中的每个bash脚本。 That will run each column individually and will automatically designate nodes. 这将单独运行每个列并自动指定节点。 For my bash loop after readarray, I'm not 100% that is the correct syntax. 对于readarray之后的bash循环,我不是100%,这是正确的语法。 If you are on a linux, use date -d @ item . 如果您使用的是Linux,请使用date -d @ item

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM