![](/img/trans.png)
[英]Python Regex - How do I fetch a word after a specific word in a string using python regex?
[英]How do I group lines by a specific word in a string in python
我在 python 中有一個多行字符串,看起來像這樣
"""1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""
我希望能夠按 python 中的動物對特定行進行分組。 所以我的 output 看起來像
dog
1234 dog list some words
1348 dog line 4
1678 dog line 5
cat
1432 cat line 2
1789 cat line3
1093 cat more words
fish
1733 fish line 6
到目前為止,我知道我需要按每一行拆分文本
def parser(txt):
for line in txt.splitlines():
print(line)
但我不確定如何繼續。 我如何將每行與動物分組?
您可以使用defaultdict並拆分每一行:
from collections import defaultdict
txt = """123 dog foo
456 cat bar
1234 dog list some words
1348 dog line 4
1432 cat line 2
1789 cat line3
1093 cat more words
1678 dog line 5
"""
def parser(txt):
result = defaultdict(list)
for line in txt.splitlines():
num, animal, _ = line.split(' ', 2) # split the first 2 blancs, skip the rest!
result[animal].append(line) # add animal and the whole line into result
return result
result = parser(txt)
for animal, lines in result.items():
print('>>> %s' % animal)
for line in lines:
print(line)
print("")
Output:
>>> dog
123 dog foo
1234 dog list some words
1348 dog line 4
1678 dog line 5
>>> cat
456 cat bar
1432 cat line 2
1789 cat line3
1093 cat more words
str1 = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""
animals = ["dog", "cat", "fish"]
tmp = {}
tmp1= []
currentAnimal = ""
listOfWords = str1.split(" ")
for index, line in enumerate(listOfWords, start=1):
if line in animals:
currentAnimal = line
if len(tmp1)>0:
tmp1.pop()
if currentAnimal not in tmp.keys():
tmp[currentAnimal] = []
tmp[currentAnimal].append(tmp1)
tmp1=[]
tmp1 = []
tmp1.append(listOfWords[index-2])
tmp1.append(listOfWords[index-1])
else:
tmp1.append(listOfWords[index-1])
for eachKey in tmp:
print eachKey
listOfStrings = tmp[eachKey]
for eachItem in listOfStrings:
if len(eachItem) > 0:
print (" ").join(eachItem)
OUTPUT:
fish
1678 dog line 5
dog
1789 cat line3
1348 dog line 4
cat
1234 dog list some words
1432 cat line 2
1733 fish line 6
我知道還有其他答案,但我更喜歡我的答案(哈哈哈)。
無論如何,我解析了原始字符串,就好像該字符串沒有\n
(換行符)字符一樣。
為了得到動物和句子,我使用了正則表達式:
import re
# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""
# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)
此時,我得到了組中的元groups
列表:
>>> groups
[('1234 dog list some words ', 'dog'),
('1432 cat line 2 ', 'cat'),
('1789 cat line3 ', 'cat'),
('1348 dog line 4 ', 'dog'),
('1678 dog line 5 ', 'dog'),
('1733 fish line 6 ', 'fish'),
('1093 cat more words', 'cat')]
然后我想把所有提到同一種動物的句子分組。 這就是為什么我創建了一個名為 hash 表(又名字典,在 Python 中)的數據結構:
# create a dictionary to store the formatted data
dct = {}
for group in groups:
if group[1] in dct:
dct[group[1]].append(group[0])
else:
dct[group[1]] = [group[0]]
dct
字典如下所示:
>>> dct
{'dog': ['1234 dog list some words ', '1348 dog line 4 ', '1678 dog line 5 '],
'cat': ['1432 cat line 2 ', '1789 cat line3 ', '1093 cat more words'],
'fish': ['1733 fish line 6 ']}
最后,我們只需要以您想要的格式打印它:
# then print the result in the format you like
for key, value in dct.items():
print(key)
for sentence in value:
print(sentence)
print()
output 是:
dog
1234 dog list some words
1348 dog line 4
1678 dog line 5
cat
1432 cat line 2
1789 cat line3
1093 cat more words
fish
1733 fish line 6
最終代碼如下:
import re
# original string with no new line characters
txt = """1234 dog list some words 1432 cat line 2 1789 cat line3 1348 dog line 4 1678 dog line 5 1733 fish line 6 1093 cat more words"""
# use findall to capture the groups
groups = re.findall("(?=(\d{4} (\w+) .*?(?=\d{4}|$)))", txt)
# create a dictionary to store the formatted data
dct = {}
for group in groups:
if group[1] in dct:
dct[group[1]].append(group[0])
else:
dct[group[1]] = [group[0]]
# then print the result in the format you like
for key, value in dct.items():
print(key)
for sentence in value:
print(sentence)
print()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.