按换行符和大写字母的正则表达式拆分

Question

I've been struggling to split my string by a regex expression in Python.我一直在努力通过 Python 中的正则表达式来分割我的字符串。

I have a text file which I load that is in the format of:我有一个我加载的文本文件，格式为：

"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
 at Kate's house. Kyle went home at 9. \nSome other sentence 
 here\n\u2022Here's a bulleted line"

I'd like to get the following output:我想得到以下输出：

['Peter went to the gym; he worked out for two hours','Kyle ate lunch 
at Kate's house. He went home at 9.', 'Some other sentence here', 
'\u2022Here's a bulleted line']

I'm looking to split my string by a new line and a capital letter or a bullet point in Python.我希望在 Python 中用一个新行和一个大写字母或一个项目符号来分割我的字符串。

I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.我已经尝试解决问题的前半部分，只用一个新行和大写字母来分割我的字符串。

Here's what I have so far:这是我到目前为止所拥有的：

print re.findall(r'\n[A-Z][a-z]+',str,re.M)

This just gives me:这只是给我：

[u'\nKyle', u'\nSome']

which is just the first word.这只是第一个词。 I've tried variations of that regex expression but I don't know how to get the rest of the line.我已经尝试过该正则表达式的变体，但我不知道如何获得该行的其余部分。

I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter.我假设也要按项目符号分割，我将只包含一个 OR 正则表达式，该表达式与按大写字母分割的正则表达式格式相同。 Is this the best way?这是最好的方法吗？

I hope this makes sense and I'm sorry if my question is in anyway unclear.我希望这是有道理的，如果我的问题不清楚，我很抱歉。 :) :)

Answer 1

You can split at a \\n proceeded by a capital letter or the bullet character:您可以在\\n处以大写字母或项目符号字符进行拆分：

import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch 
at Kate's house. Kyle went home at 9. \nSome other sentence 
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))

Output:输出：

['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]

Or, without using the symbol for the bullet character:或者，不使用项目符号字符的符号：

new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

Answer 2

You can use this split function:您可以使用此split功能：

>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)

[u'Peter went to the gym; \nhe worked out for two hours ',
 u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
 u'Some other sentence here',
 u"\u2022Here's a bulleted line"]

Code Demo代码演示

按换行符和大写字母的正则表达式拆分

问题描述

2 个解决方案

解决方案1
1 2018-02-18 15:23:12

解决方案2
1 已采纳 2018-02-18 16:16:06

按换行符和大写字母的正则表达式拆分

问题描述

2 个解决方案

解决方案1 1 2018-02-18 15:23:12

解决方案2 1 已采纳 2018-02-18 16:16:06

解决方案1
1 2018-02-18 15:23:12

解决方案2
1 已采纳 2018-02-18 16:16:06