簡體   English   中英

按換行符和大寫字母的正則表達式拆分

[英]Split by regex of new line and capital letter

我一直在努力通過 Python 中的正則表達式來分割我的字符串。

我有一個我加載的文本文件,格式為:

"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
 at Kate's house. Kyle went home at 9. \nSome other sentence 
 here\n\u2022Here's a bulleted line"

我想得到以下輸出:

['Peter went to the gym; he worked out for two hours','Kyle ate lunch 
at Kate's house. He went home at 9.', 'Some other sentence here', 
'\u2022Here's a bulleted line']

我希望在 Python 中用一個新行和一個大寫字母或一個項目符號來分割我的字符串。

我已經嘗試解決問題的前半部分,只用一個新行和大寫字母來分割我的字符串。

這是我到目前為止所擁有的:

print re.findall(r'\n[A-Z][a-z]+',str,re.M)

這只是給我:

[u'\nKyle', u'\nSome']

這只是第一個詞。 我已經嘗試過該正則表達式的變體,但我不知道如何獲得該行的其余部分。

我假設也要按項目符號分割,我將只包含一個 OR 正則表達式,該表達式與按大寫字母分割的正則表達式格式相同。 這是最好的方法嗎?

我希望這是有道理的,如果我的問題不清楚,我很抱歉。 :)

您可以在\\n處以大寫字母或項目符號字符進行拆分:

import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
at Kate's house. Kyle went home at 9. \nSome other sentence 
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))

輸出:

['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]

或者,不使用項目符號字符的符號:

new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

您可以使用此split功能:

>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)

[u'Peter went to the gym; \nhe worked out for two hours ',
 u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
 u'Some other sentence here',
 u"\u2022Here's a bulleted line"]

代碼演示

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM