[英]Split by regex of new line and capital letter
我一直在努力通過 Python 中的正則表達式來分割我的字符串。
我有一個我加載的文本文件,格式為:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
我想得到以下輸出:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
我希望在 Python 中用一個新行和一個大寫字母或一個項目符號來分割我的字符串。
我已經嘗試解決問題的前半部分,只用一個新行和大寫字母來分割我的字符串。
這是我到目前為止所擁有的:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
這只是給我:
[u'\nKyle', u'\nSome']
這只是第一個詞。 我已經嘗試過該正則表達式的變體,但我不知道如何獲得該行的其余部分。
我假設也要按項目符號分割,我將只包含一個 OR 正則表達式,該表達式與按大寫字母分割的正則表達式格式相同。 這是最好的方法嗎?
我希望這是有道理的,如果我的問題不清楚,我很抱歉。 :)
您可以在\\n
處以大寫字母或項目符號字符進行拆分:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
輸出:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
或者,不使用項目符號字符的符號:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))
您可以使用此split
功能:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.