简体   繁体   中英

Parsing file into a dictionary in python

I have a file, a small fragment of it you can see below:

Clutch001
Albino X Pastel
Bumble Bee X Albino Lesser
Clutch002
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave X Bumble Bee
Clutch003
Black Pastel X Banana Ghost Lesser
....

Number of strings between ClucthXXX and next ClutchXXX might be different but not equal to zero. I was wondering if it's possible somehow to take a specific string from a file using it as a key (in my case it would be ClutchXXX) and the text till the second occurrence of the specific string as a value for a dictionary? I want to receive such dictionary:

d={'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
   'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
   'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

I am mostly interested in the part where we take string pattern and save it as a key and the text after as a value. Any suggestions or directions to a useful approach would be appreciated.

from itertools import groupby
from functools import partial

key = partial(re.match, r'Clutch\d\d\d')

with open('foo.txt') as f:
    groups = (', '.join(map(str.strip, g)) for k, g in groupby(f, key=key))
    pprint(dict(zip(*[iter(groups)]*2)))

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

Collect the lines in lists, storing that list in a dictionary at the same time:

d = {}
values = None
with open(filename) as inputfile:
    for line in inputfile:
        line = line.strip()
        if line.startswith('Clutch'):
            values = d[line] = []
        else:
            values.append(line)

This gives you:

{'Clutch001': ['Albino X Pastel', 'Bumble Bee X Albino Lesser']
 'Clutch002': ['Bee X Fire Bee', 'Albino Cinnamon X Albino', 'Mojave X Bumble Bee']
 'Clutch003': ['Black Pastel X Banana Ghost Lesser']}

It's easy enough to turn all those lists into single strings though, after loading the file:

d = {key: ', '.join(value) for key, value in d.items()}

You can also do the joining as you read the file; I'd use a generator function to process the file in groups:

def per_clutch(inputfile):
    clutch = None
    lines = []
    for line in inputfile:
        line = line.strip()
        if line.startswith('Clutch'):
            if lines:
                yield clutch, lines
            clutch, lines = line, []
        else:
            lines.append(line)
    if clutch and lines:
        yield clutch, lines

then just slurp all groups into a dictionary:

with open(filename) as inputfile:
    d = {clutch: ', '.join(lines) for clutch, lines in per_clutch(inputfile)}

Demo of the latter:

>>> def per_clutch(inputfile):
...     clutch = None
...     lines = []
...     for line in inputfile:
...         line = line.strip()
...         if line.startswith('Clutch'):
...             if lines:
...                 yield clutch, lines
...             clutch, lines = line, []
...         else:
...             lines.append(line)
...     if clutch and lines:
...         yield clutch, lines
... 
>>> sample = '''\
... Clutch001
... Albino X Pastel
... Bumble Bee X Albino Lesser
... Clutch002
... Bee X Fire Bee
... Albino Cinnamon X Albino
... Mojave X Bumble Bee
... Clutch003
... Black Pastel X Banana Ghost Lesser
... '''.splitlines(True)
>>> {clutch: ', '.join(lines) for clutch, lines in per_clutch(sample)}
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
>>> from pprint import pprint
>>> pprint(_)
{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser',
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

As noted in comments, if "Clutch" (or whatever keyword) can be relied on not to appear in the non-keyword lines, you could use the following:

keyword = "Clutch"
with open(filename) as inputfile:
    t = inputfile.read()
    d = {keyword + s[:3]: s[3:].strip().replace('\n', ', ') for s in t.split(keyword)}

This reads the whole file in to memory at once, so should be avoided if your file may get very large.

You could use re.split() to enumerate "Clutch" parts in the file:

import re

tokens = iter(re.split(r'(^Clutch\d{3}\s*$)\s+', file.read(), flags=re.M))
next(tokens) # skip until the first Clutch
print({k: ', '.join(v.splitlines()) for k, v in zip(tokens, tokens)})

Output

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 
 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee',
 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

Lets file 'file.txt' contains:

Clutch001
Albino X Pastel
Bumble Bee X Albino Lesser
Clutch002
Bee X Fire Bee
Albino Cinnamon X Albino
Mojave X Bumble Bee
Clutch003
Black Pastel X Banana Ghost Lesser

To receive your dictionary try this:

import re

with open('file.txt', 'r') as f:
    result = re.split(
        r'(Clutch\d{3}).*?',
        f.read(),
        flags=re.DOTALL # including '\n'
    )[1:] # result is ['Clutch001', '\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', 'Clutch002', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', 'Clutch003', '\nBlack Pastel X Banana Ghost Lesser\n']

    keys = result[::2] # keys is ['Clutch001', 'Clutch002', 'Clutch003']
    values = result[1::2] # values is ['\nAlbino X Pastel\nBumble Bee X Albino Lesser\n', '\nBee X Fire Bee\nAlbino Cinnamon X Albino\nMojave X Bumble Bee\n', '\nBlack Pastel X Banana Ghost Lesser\n']

    values = map(
        lambda value: value.strip().replace('\n', ', '),
        values
    ) # values is ['Albino X Pastel, Bumble Bee X Albino Lesser', 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Black Pastel X Banana Ghost Lesser']

    d = dict(zip(keys, values)) # d is {'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}

Here's a version that works, more or less. I'm not sure how Pythonic it is (it can probably be squeezed and can definitely be improved):

import re
import fileinput

d = dict()
key = ''
rx = re.compile('^Clutch\d\d\d$')

for line in fileinput.input():
    line = line[0:-1]
    if rx.match(line):
        key = line
        d[key] = ''
    else:
        d[key] += line

print d

for key in d:
    print key, d[key]

The output (which repeats the information) is:

{'Clutch001': 'Albino X PastelBumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
Clutch001 Albino X PastelBumble Bee X Albino Lesser
Clutch002 Bee X Fire BeeAlbino Cinnamon X AlbinoMojave X Bumble Bee
Clutch003 Black Pastel X Banana Ghost Lesser

If for some reason the first line isn't a 'clutch' line, you get an error because of the empty key.

Joining with commas, dealing with broken text files (no newline at the end) etc:

import fileinput

d = {}

for line in fileinput.input():
    line = line.rstrip('\r\n') # line.strip() for leading and trailing space
    if line.startswith('Clutch'):
        key = line
        d[key] = ''
        pad = ''
    else:
        d[key] += pad + line
        pad = ', '

print d

for key in d:
    print "'%s': '%s'" % (key, d[key])

The 'pad' technique is one I like in other contexts, and it works fine here. I'm tolerably certain it wouldn't be regarded as Pythonic, though.

Revised sample output:

{'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser', 'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee', 'Clutch003': 'Black Pastel X Banana Ghost Lesser'}
'Clutch001': 'Albino X Pastel, Bumble Bee X Albino Lesser'
'Clutch002': 'Bee X Fire Bee, Albino Cinnamon X Albino, Mojave X Bumble Bee'
'Clutch003': 'Black Pastel X Banana Ghost Lesser'

Assuming the word Clutch occurs independently on its own line, the following will work:

import re
d = {}
with open(filename) as f:
for line in f:
    if re.match("^Clutch[0-9]+", line) :
        match = line   # match is the key searched for
        match = match.replace('\n', ' ')    # newlines are replaced
        d[match] = ''
    else:
        line = line.replace('\n', ' ')
        d[match] += line  # all lines without the word 'Clutch'
                          # are added to the matched key

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM