简体   繁体   English

从字符串(或 DataFrame 对象)中提取时间数据的更有效方法

[英]More efficient way to pull time data from a string (or DataFrame object)

I'm learning Python on my own and this is my first question here.我正在自己学习 Python,这是我在这里的第一个问题。 Always was able to find everything needed already answered.总是能够找到已经回答的所有需要的东西。 Finally got something I believe it's worth to ask.终于得到了一些我认为值得一问的东西。 It's just more specific task, which I don't even know what to search for.这只是更具体的任务,我什至不知道要搜索什么。

One of our machines is generating a log file, which requires a lot of cleaning after loading to a DataFrame and before being able to use.我们的一台机器正在生成一个日志文件,在加载到 DataFrame 之后以及能够使用之前,需要进行大量清理。 Without going into too much details, a log file contains time record in a very weird format.无需赘述,日志文件包含格式非常奇怪的时间记录。 It's build of minutes, seconds and miliseconds.它是由分钟、秒和毫秒组成的。 I was able to decode it to seconds with use of a function shown below (and further convert it into time format with another one).我能够使用如下所示的 function 将其解码为秒(并使用另一种进一步将其转换为时间格式)。 It works fine, but this is a very basic function with a lot of if statemets.它工作正常,但这是一个非常基本的 function 有很多 if 语句。

My goal is to rewrite it into more less amateur looking, however the log time format puts some challenging limitations at least for me.我的目标是将它改写成更不那么业余的外观,但是日志时间格式至少对我来说提出了一些具有挑战性的限制。 And it's not helping that even the units are a combination of the same two letters.即使单位是相同的两个字母的组合也无济于事。

Here are samples of all possible time record combinations:以下是所有可能的时间记录组合的示例:

test1 = 'T#3853m10s575ms'   # 231190.575 [seconds]
test2 = 'T#10s575ms'        # 10.575
test3 = 'T#3853m575ms'      # 231180.575
test4 = 'T#575ms'           # 0.575
test5 = 'T#3853m10s'        # 231190
test6 = 'T#10s'             # 10
test7 = 'T#3853m'           # 231180
test8 = 'T#0ms'             # 0

I've tried to write it in regular expression format as: T#[0-9]*m?[0-9]*s?[0-9]*ms?我尝试以正则表达式格式将其编写为: T#[0-9]*m?[0-9]*s?[0-9]*ms? however there would always be at least one digit present and at least one unit.然而,总会有至少一位数字和至少一个单位。

Here is the logic I'm using inside the function: function diagram这是我在 function 中使用的逻辑: function 图

And here is the function I apply to a raw time column in a DataFrame:这是我应用于 DataFrame 中的原始时间列的 function:

def convert_time(string):
    if string == 'T#0ms':
        return 0
    else:
        ms_ = False if string.find('ms') == -1 else True
        string = string[2:-2] if ms_ else string[2:]
        s_ = False if string.find('s') == -1 else True
        m_ = False if string.find('m') == -1 else True
        if m_ and s_ and ms_:
            m, temp = string.split('m')
            s, ms = temp.split('s')
            return int(m)*60 + int(s) + int(ms)*0.001
        elif not m_ and s_ and ms_:
            s, ms = string.split('s')
            return int(s) + 0.001 * int(ms)
        elif m_ and not s_ and ms_:
            m, ms = string.split('m')
            return 60*int(m) + 0.001 * int(ms)
        elif not m_ and not s_ and ms_:
            return int(string) * 0.001
        elif m_ and s_ and not ms_:
            m, s = string.split('m')
            return 60*int(m) + int(s[:-1])
        elif not m_ and s_ and not ms_:
            return int(string[:-1])
        elif m_ and not s_ and not ms_:
            return int(string[:-1]) * 60
        elif not m_ and not s_ and not ms_:
            return -1

Like mentioned above a lack of experience doesn't allow me to write a better function to result in similar output (or better, eg directly in time format).如上所述,缺乏经验不允许我编写更好的 function 以产生类似的 output (或更好,例如直接以时间格式)。 Hope that would be interesting enough to get some improvement hints.希望这会足够有趣以获得一些改进提示。 Thanks.谢谢。

def str_to_sec(time_str):
    return_int = 0
    cur_int = 0

    # remove start characters and replace 'ms' with a single character as unit
    time_str = time_str.replace('T#','').replace('ms', 'p')

    # build multiplier matrix
    split_order = ['m', 's', 'p']
    multiplier = [60, 1, 0.001]
    calc_multiplier_dic = dict(zip(split_order, multiplier))

    # loop through string and update the cumulative time
    for ch in time_str:
        if ch.isnumeric():
            cur_int = cur_int * 10 + int(ch)
            continue
        if ch.isalpha():
            return_int += cur_int * calc_multiplier_dic[ch]
            cur_int = 0

    return return_int

Using regex:使用正则表达式:

import re

def f(x):
    x = x[2:]
    time = re.findall(r'\d+', x)
    timeType = re.findall(r'[a-zA-Z]+',x)
    #print(time,timeType)
    total = 0
    for i,j in zip(time,timeType):
        if j == 'm':
            total += 60*float(i) 
        elif j =='s':
            total+=float(i) 
        elif j == 'ms':
            total += float(i)/1000
    return total 

test1 = 'T#3853m10s575ms'   # 231190.575 [seconds]
test2 = 'T#10s575ms'        # 10.575
test3 = 'T#3853m575ms'      # 231180.575
test4 = 'T#575ms'           # 0.575
test5 = 'T#3853m10s'        # 231190
test6 = 'T#10s'             # 10
test7 = 'T#3853m'           # 231180
test8 = 'T#0ms'             # 0

arr = [test1,test2,test3,test4,test5,test6,test7,test8]

for t in arr:
    print(f(t))

Output: Output:

231190.575
10.575
231180.575
0.575
231190.0
10.0
231180.0
0.0
[Finished in 0.7s]

Or you can make look code smaller if you have more time type like an hour, day etc..或者,如果您有更多时间类型(例如一小时、一天等),则可以使外观代码更小。
Use map for it为它使用 map

import re
def symbol(j):
    if j == 'm':
        return 60 
    elif j =='s':
        return 1  
    elif j == 'ms':
        return .001

def f(x):
    x = x[2:]
    time = list(map(float,re.findall(r'\d+', x)))
    timeType = list(map(symbol,re.findall(r'[a-zA-Z]+',x)))
    #print(time,timeType)
    return sum([a*b for a,b in zip(timeType,time)]) 

test1 = 'T#3853m10s575ms'   # 231190.575 [seconds]
test2 = 'T#10s575ms'        # 10.575
test3 = 'T#3853m575ms'      # 231180.575
test4 = 'T#575ms'           # 0.575
test5 = 'T#3853m10s'        # 231190
test6 = 'T#10s'             # 10
test7 = 'T#3853m'           # 231180
test8 = 'T#0ms'             # 0

arr = [test1,test2,test3,test4,test5,test6,test7,test8]

for t in arr:
    print(f(t))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM