简体   繁体   中英

Cascaded string split, pythonic way

Take for example this format from IANA: http://www.iana.org/assignments/language-subtag-registry

%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
%%

Say I open the file:

import urllib
f = urllib.urlopen("http://www.iana.org/assignments/language-subtag-registry")
all=f.read()

Normally you would do like this

lan=all.split("%%") 

the iterate lan and split("\\n") then iterate the result and split(":"), is there a way to to this in python in one batch without the iteration and the output still be like this: [[["Type","language"],["Subtag", "ae"],...]...] ?

I don't see any sense in trying to do this in a single pass, if the elements you are getting to after each split are semantically diffent.

You could start by spliting by ":" -- that wold get you to the fine grained data - but what good would that be, if you wold not know were does this data belong?

That said, you could put all the levels of separation inside a generator, and have it yield dictionary-objects with your data, ready for consunption:

def iana_parse(data):
    for record in data.split("%%\n"):
        # skip empty records at file endings:
        if not record.strip():
            continue
        rec_data = {}
        for line in record.split("\n"):
            key, value = line.split(":")
            rec_data[key.strip()] = value.strip()
        yield rec_data

It can be done as a one liner as you request in the comments - but as I commented back, It could be written to fit as a single expression in one line. It took more time to write than the example above, and would be nearly impossible to maintain. The code in the example above unfolds the logic in a few lines of code, that are placed "out of the way" - ie not inline where you are deaing witht he actual data, providing readability and maintainability for both tasks.

That said, parsing as a structure of nested lists as you want can be done thus:

structure = [[[token.strip() for token in line.split(":")] for line in record.split("\n") ] for record in data.split("%%") if record.strip() ]

As a single comprehension:

raw = """\
%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
%%"""


data = [
     dict(
         row.split(': ')
         for row in item_str.split("\n")
         if row  # required to avoid the empty lines which contained '%%'
     )
     for item_str in raw.split("%%") 
     if item_str  # required to avoid the empty items at the start and end
]
>>> data[0]['Added']
'2005-10-16'

Regexes , but I don't see the point:

re.split('%%|:|\\n', string)

Here multiple patterns were chained using the or | operator.

You can use itertools.groupby :

ss = """%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
"""
sss = ss.splitlines(True) #List which looks like you're iterating over a file object


import itertools

output = []
for k,v in itertools.groupby(sss,lambda x: x.strip() == '%%'):
    if(k):  #Hit a '%%' record.  Need a new group.
        print "\nNew group:\n"
        current = {}
        output.append(current)
    else:   #just a regular record, write the data to our current record dict.
        for line in v:
            print line.strip()
            key,value = line.split(None,1)
            current[key] = value

One benefit of this answer is that it doesn't require you to read the entire file. The whole expression is evaluated lazily.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM