Python最干净的解析方式

Question

A有几条格式为"TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1"日志行（语法为timerName：time / instances）`，这就是我的用法解析它

ServiceTimer = namedtuple("ServiceTimer", ["timerName", "time", "instances"])
timers = []
for entry in line.split(","):
    name, rest = entry.split(":")
    time, instances = rest.split("/")
    timers.append(ServiceTimer(name, float(time), int(instances)))

有没有更好的方法，因为有数百万条日志行，所以还需要快速。 任何指针都很棒。

Answer 1

我测试了三个版本：

您的原始代码没有命名元组。
带类型转换的regexp示例。
另一个带有一些速度技巧的正则表达式版本。

结果有点让我惊讶。 我的结果表明，“ string” .split确实非常快，比示例正则表达式处理快。 为了使regexp更快，您必须使用内存映射文件，而忘记逐行处理。

这是temp.py中的源代码：

def process1():
    results = []
    with open('temp.txt') as fptr:
        for line in fptr:
            for entry in line.split(','):
                name, rest = entry.split(":")
                time, instances = rest.split("/")
                results.append((name, float(time), int(instances)))
    return len(results)

def process2():
    from re import finditer
    results = []
    with open('temp.txt') as fptr:
        for line in fptr:
            for match in finditer(r'([^,:]*):([^/]*)/([^,]*)', line):
                results.append(
                    (match.group(1), float(match.group(2)), int(match.group(3))))
    return len(results)

def process3():
    from re import finditer
    import mmap
    results = []
    with open('temp.txt', 'r+') as fptr:
        fmap = mmap.mmap(fptr.fileno(), 0)
        for match in finditer(r'([^,:]*):([^/]*)/([^,\r\n]*)', fmap):
            results.append(
                (match.group(1), float(match.group(2)), int(match.group(3))))
    return len(results)

我在“ temp.txt”文本文件上测试了这些功能，其中有100万个示例行重复项。 结果如下：

In [8]: %time temp.process1()
CPU times: user 10.24 s, sys: 0.00 s, total: 10.24 s
Wall time: 10.24 s
Out[8]: 4000000

In [9]: %time temp.process2()
CPU times: user 12.63 s, sys: 0.00 s, total: 12.63 s
Wall time: 12.63 s
Out[9]: 4000000

In [10]: %time temp.process3()
CPU times: user 9.43 s, sys: 0.00 s, total: 9.43 s
Wall time: 9.43 s
Out[10]: 4000000

因此，忽略文件逐行处理和内存映射的正则表达式版本比示例代码快7％。 示例正则表达式代码比示例慢23％。

故事的寓意：始终保持基准。

Answer 2

每个@zaftcoAgeiha建议，使用正则表达式：

from re import finditer
line = "TimeA:0.216/1,TimeB:495.761/1,TimeC:2.048/2,TimeD:0.296/1"
[ m.groups( ) for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]

你会得到：

[('TimeA', '0.216', '1'),
 ('TimeB', '495.761', '1'),
 ('TimeC', '2.048', '2'),
 ('TimeD', '0.296', '1')]

对于类型转换，您可以使用group方法：

[ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
    for m in finditer( r'([^,:]*):([^/]*)/([^,]*)', line ) ]

编辑：要分析整个文件，您需要首先编译模式并使用列表理解而不是append ：

from re import compile

regex = compile( r'([^,:]*):([^/]*)/([^,]*)' )
with open( 'fname.txt', 'r' ) as fin:
    results = [ ( m.group(1), float( m.group(2) ) , int( m.group(3) ))
        for m in regex.finditer( line ) for line in fin]

Answer 3

也许用更少的线..

  for entry in line.split(','):
    split_line = entry.split(":")[1].split('/')
    timers.append(ServiceTimer(entry.split(':')[0],float(split_line[0]),int(split_line[1])

Python最干净的解析方式

问题描述

3 个解决方案

解决方案1
2 2013-12-04 04:15:37

解决方案2
1 2013-12-04 03:18:24

解决方案3
0 2013-12-04 02:46:09

Python最干净的解析方式

问题描述

3 个解决方案

解决方案1 2 2013-12-04 04:15:37

解决方案2 1 2013-12-04 03:18:24

解决方案3 0 2013-12-04 02:46:09

解决方案1
2 2013-12-04 04:15:37

解决方案2
1 2013-12-04 03:18:24

解决方案3
0 2013-12-04 02:46:09