簡體   English   中英

如何按 python 中的字符串中的特定單詞對行進行分組

[英]How do I group lines by a specific word in a string in python

我在 python 中有一個多行字符串,看起來像這樣

"""1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

我希望能夠按 python 中的動物對特定行進行分組。 所以我的 output 看起來像

dog
1234 dog list some words 
1348 dog line 4
1678 dog line 5

cat
1432 cat line 2 
1789 cat line3 
1093 cat more words

fish
1733 fish line 6

到目前為止,我知道我需要按每一行拆分文本

def parser(txt):
    for line in txt.splitlines():
        print(line)

但我不確定如何繼續。 我如何將每行與動物分組?

您可以使用defaultdict並拆分每一行:

from collections import defaultdict

txt = """123 dog foo
456 cat bar
1234 dog list some words
1348 dog line 4
1432 cat line 2 
1789 cat line3 
1093 cat more words
1678 dog line 5
"""


def parser(txt):
    result = defaultdict(list)
    for line in txt.splitlines():
        num, animal, _ = line.split(' ', 2)  # split the first 2 blancs, skip the rest!
        result[animal].append(line)  # add animal and the whole line into result
    return result

result = parser(txt)
for animal, lines in result.items():
    print('>>> %s' % animal)
    for line in lines:
        print(line)
    print("")

Output:

>>> dog
123 dog foo
1234 dog list some words
1348 dog line 4
1678 dog line 5

>>> cat
456 cat bar
1432 cat line 2 
1789 cat line3 
1093 cat more words
str1 = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

animals = ["dog", "cat", "fish"]
tmp = {}
tmp1= []
currentAnimal = ""
listOfWords = str1.split(" ")
for index, line in enumerate(listOfWords, start=1):
    if line in animals:
        currentAnimal = line
        if len(tmp1)>0:
            tmp1.pop()
            if currentAnimal not in tmp.keys():
                tmp[currentAnimal] = []
            tmp[currentAnimal].append(tmp1)
            tmp1=[]
        tmp1 = []
        tmp1.append(listOfWords[index-2])
        tmp1.append(listOfWords[index-1])
    else:
        tmp1.append(listOfWords[index-1])

for eachKey in tmp:
    print eachKey
    listOfStrings = tmp[eachKey]
    for eachItem in listOfStrings:
        if len(eachItem) > 0:
            print (" ").join(eachItem)

OUTPUT:

fish
1678 dog line 5
dog
1789 cat line3
1348 dog line 4
cat
1234 dog list some words
1432 cat line 2
1733 fish line 6

我知道還有其他答案,但我更喜歡我的答案(哈哈哈)。

無論如何,我解析了原始字符串,就好像該字符串沒有\n (換行符)字符一樣。

為了得到動物和句子,我使用了正則表達式:

import re

# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)

此時,我得到了組中的元groups列表:

>>> groups
[('1234 dog list some words ', 'dog'),
 ('1432 cat line 2 ', 'cat'),
 ('1789 cat line3 ', 'cat'),
 ('1348 dog line 4 ', 'dog'),
 ('1678 dog line 5 ', 'dog'),
 ('1733 fish line 6 ', 'fish'),
 ('1093 cat more words', 'cat')]

然后我想把所有提到同一種動物的句子分組。 這就是為什么我創建了一個名為 hash 表(又名字典,在 Python 中)的數據結構:

# create a dictionary to store the formatted data
dct = {}
for group in groups:
    if group[1] in dct:
        dct[group[1]].append(group[0])
    else:
        dct[group[1]] = [group[0]]

dct字典如下所示:

>>> dct
{'dog': ['1234 dog list some words ', '1348 dog line 4 ', '1678 dog line 5 '],
 'cat': ['1432 cat line 2 ', '1789 cat line3 ', '1093 cat more words'],
 'fish': ['1733 fish line 6 ']}

最后,我們只需要以您想要的格式打印它:

# then print the result in the format you like
for key, value in dct.items():
    print(key)
    for sentence in value:
        print(sentence)
    print()

output 是:

dog
1234 dog list some words 
1348 dog line 4 
1678 dog line 5 

cat
1432 cat line 2 
1789 cat line3 
1093 cat more words

fish
1733 fish line 6 

最終代碼如下:

import re

# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""

# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)

# create a dictionary to store the formatted data
dct = {}
for group in groups:
    if group[1] in dct:
        dct[group[1]].append(group[0])
    else:
        dct[group[1]] = [group[0]]

# then print the result in the format you like
for key, value in dct.items():
    print(key)
    for sentence in value:
        print(sentence)
    print()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM