简体   繁体   中英

What is the best way to create a python dictionary from a string?

I have a large file in which I will parse about 1.9E8 lines.

During each iteration I will create a temporary dictionary to send to another method, which will give me the output I want.

Since the file is too large, I can't open it with readlines() method.

So my last resort to make it faster is during the parsing.

I already have two options for generating the dictionary. optionB has better performance than optionA and I am aware I could try regex, however I am not familiar with it. I am willing to receive insights of better alternatives, if there is any.

Expected input: "A@1:100;2:240;...:.." input may be longer, it can have more groups and their frequencies

def optionA(line):
    _id, info = line.split("@")
    data = {}
    for g_info in info.split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionB(line):
    _id, info = line.split("@")
    return dict(map(lambda i: i.split(":"), info.split(";")))

Expected output: {'1': '100', '2': '240'}

I am open for receiving any recomendation!

Quick example of a regex to parse the line:

>>> import re
>>> line = 'A@1:100;2:240'
>>> data = re.search(r'@(\d+):(\d+);(\d+):(\d+)',line).groups()
>>> D = {data[0]:data[1],data[2]:data[3]}
>>> D
{'1': '100', '2': '240'}

Here's some timings with:

import re
regex = re.compile(r'@(\d+):(\d+);(\d+):(\d+)')

def optionA(line):
    _id, info = line.split("@")
    data = {}
    for g_info in info.split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionB(line):
    _id, info = line.split("@")
    return dict(map(lambda i: i.split(":"), info.split(";")))

def optionC(line):
    data = regex.search(line).groups()
    return {data[0]:data[1],data[2]:data[3]}

line = 'A@1:100;2:240'

Times:

C:\>py -m timeit -s "import x" "x.optionA(x.line)"
100000 loops, best of 3: 3.01 usec per loop

C:\>py -m timeit -s "import x" "x.optionB(x.line)"
100000 loops, best of 3: 5.15 usec per loop

C:\>py -m timeit -s "import x" "x.optionC(x.line)"
100000 loops, best of 3: 2.88 usec per loop

Edit: With the slight change in requirements, I tried a findall for optionC and a slightly different version of optionA :

import re
regex = re.compile(r'(\d+):(\d+)')

def optionA(line):
    _id, info = line.split("@")
    data = {}
    for g_info in info.split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionAA(line):
    data = {}
    for g_info in line[2:].split(";"):
        k, v = g_info.split(":")
        data[k] = v
    return data

def optionB(line):
    _id, info = line.split("@")
    return dict(map(lambda i: i.split(":"), info.split(";")))

def optionC(line):
    return dict(regex.findall(line))

line = 'A@1:100;2:240;3:250;4:260;5:100;6:100;7:100;8:100;9:100;10:100'

Timings:

C:\>py -m timeit -s "import x" "x.optionA(x.line)"
100000 loops, best of 3: 8.35 usec per loop

C:\>py -m timeit -s "import x" "x.optionAA(x.line)"
100000 loops, best of 3: 8.17 usec per loop

C:\>py -m timeit -s "import x" "x.optionB(x.line)"
100000 loops, best of 3: 12.3 usec per loop

C:\>py -m timeit -s "import x" "x.optionC(x.line)"
100000 loops, best of 3: 12.8 usec per loop

So it looks like the modified optionAA wins with this particular line. Hopefully this shows the importance of measuring algorithms. I'm surprised findall was slower.

Here's a simple example of using a compiled regexp to match your pattern.

import re

s = "A@1:100;2:240"
compiledre = re.compile("A@(\d+):(\d+);(\d+):(\d+)$")
res = compiledre.search(s)
if res:
    print dict([(res.group(1),res.group(2)),(res.group(3),res.group(4))])

Output is:

{'1': '100', '2': '240'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM