級聯字符串拆分，pythonic方式

Question

以來自IANA的這種格式為例： http : //www.iana.org/assignments/language-subtag-registry

%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
%%

說我打開文件：

import urllib
f = urllib.urlopen("http://www.iana.org/assignments/language-subtag-registry")
all=f.read()

通常你會這樣

lan=all.split("%%")

迭代lan和split("\\n")然后迭代結果和split（“：”），有一種方法可以在python中批量進行而無需迭代，並且輸出仍然是這樣的： [[["Type","language"],["Subtag", "ae"],...]...] ？

Answer 1

如果每次拆分后要訪問的元素在語義上有所不同，那么嘗試單步執行此操作就沒有任何意義。

您可以先以“：”分割-這將使您獲得細粒度的數據-但是，如果您不知道該數據屬於什么，那會有什么好處？

就是說，您可以將所有級別的分離放置在生成器中，並使其與數據一起產生字典對象，以備考慮：

def iana_parse(data):
    for record in data.split("%%\n"):
        # skip empty records at file endings:
        if not record.strip():
            continue
        rec_data = {}
        for line in record.split("\n"):
            key, value = line.split(":")
            rec_data[key.strip()] = value.strip()
        yield rec_data

可以按照您在注釋中的要求將其作為一個襯紙來完成-但正如我評論說的那樣，可以將其編寫為適合單個表達式的一行。 比上面的示例花費更多的時間來編寫，並且幾乎不可能維護。 上面示例中的代碼將邏輯放到幾行代碼中，這些代碼“放在一邊”（即不與您實際數據一起使用的地方是內聯的），為這兩項任務提供了可讀性和可維護性。

也就是說，可以按照需要將其解析為嵌套列表的結構：

structure = [[[token.strip() for token in line.split(":")] for line in record.split("\n") ] for record in data.split("%%") if record.strip() ]

Answer 2

作為一個單一的理解：

raw = """\
%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
%%"""


data = [
     dict(
         row.split(': ')
         for row in item_str.split("\n")
         if row  # required to avoid the empty lines which contained '%%'
     )
     for item_str in raw.split("%%") 
     if item_str  # required to avoid the empty items at the start and end
]

>>> data[0]['Added']
'2005-10-16'

Answer 3

正則表達式，但我不明白這一點：

re.split('%%|:|\\n', string)

在這里，多個模式使用或|鏈接| 操作員。

Answer 4

您可以使用itertools.groupby ：

ss = """%%
Type: language
Subtag: aa
Description: Afar
Added: 2005-10-16
%%
Type: language
Subtag: ab
Description: Abkhazian
Added: 2005-10-16
Suppress-Script: Cyrl
%%
Type: language
Subtag: ae
Description: Avestan
Added: 2005-10-16
"""
sss = ss.splitlines(True) #List which looks like you're iterating over a file object


import itertools

output = []
for k,v in itertools.groupby(sss,lambda x: x.strip() == '%%'):
    if(k):  #Hit a '%%' record.  Need a new group.
        print "\nNew group:\n"
        current = {}
        output.append(current)
    else:   #just a regular record, write the data to our current record dict.
        for line in v:
            print line.strip()
            key,value = line.split(None,1)
            current[key] = value

此答案的一個好處是它不需要您讀取整個文件。 整個表達式是惰性計算的。

級聯字符串拆分，pythonic方式

問題描述

4 個解決方案

解決方案1
3 2012-09-17 13:50:36

解決方案2
3 已采納 2013-09-10 10:17:22

解決方案3
2 2012-09-17 13:55:15

解決方案4
2 2012-09-17 13:59:48

級聯字符串拆分，pythonic方式

問題描述

4 個解決方案

解決方案1 3 2012-09-17 13:50:36

解決方案2 3 已采納 2013-09-10 10:17:22

解決方案3 2 2012-09-17 13:55:15

解決方案4 2 2012-09-17 13:59:48

解決方案1
3 2012-09-17 13:50:36

解決方案2
3 已采納 2013-09-10 10:17:22

解決方案3
2 2012-09-17 13:55:15

解決方案4
2 2012-09-17 13:59:48