正则表达式从列表中删除不是AZ az的单词（例外）

Question

I would like to remove non-alpha characters from a string and convert each word into a list component such that: 我想从字符串中删除非字母字符，并将每个单词转换为列表组件，例如：

"All, the above." -> ["all", "the", "above"]

It would seem that the following function works: 似乎以下功能有效：

re.split('\W+', str)

but it does not account for corner cases. 但这并不能说明极端情况。

For example: 例如：

"The U.S. is where it's nice." -> ["the", "U", "S", "is", "where", "it", "s", "nice"]

I want the period removed but neither the apostrophe or the periods in "US" 我希望删除句号，但不要撇号或“ US”中的句号

My idea is to create a regex where spaces are broken up but then remove extra punctuation: 我的想法是创建一个正则表达式，将空格分开，然后删除多余的标点符号：

"I, live at home." -> ["I", "live", "at", "home"] (comma and period removed)
"I J.C. live at home." -> ["I", "J.C.", "live", "at", "home"] (acronym periods not removed but end of sentence period removed)

What I'm trying to do becomes sufficiently difficult for sentences like: 对于以下句子，我想做的事情变得非常困难：

"The flying saucer (which was green)." -> ["...", "green"] (ignore ").") 
"I J.C., live at home." -> ["I", "J.C.", "..."] (ignore punctuation)

Special case (strings are retrieved from raw text file): 特殊情况（从原始文本文件中检索字符串）：

"I love you.<br /> Come home soon!" -> ["..."] (ignore breakpoint and punctuation)

I am relatively new to python and creating regex's is confusing to me so any help on how to parse strings in this way would be very helpful!! 我对python比较陌生，创建正则表达式令我感到困惑，因此任何有关如何以这种方式解析字符串的帮助都将非常有帮助！ If there is a catch 22 here, and not all things I am trying to accomplish are possible let me know. 如果这里有一个钩子22，让我知道并不是我试图完成的所有事情。

Answer 1

Although I understand you are asking specifically about regex, another solution to your overall problem is to use a library for this express purpose. 尽管我知道您是在专门询问正则表达式，但解决您总体问题的另一种方法是将库用于此明确目的。 For instance nltk . 例如nltk 。 It should help you split your strings in sane ways (parsing out the proper punctuation into separate items in a list) which you can then filter out from there. 它应该可以帮助您以理智的方式拆分字符串（将适当的标点符号解析为列表中的单独项目），然后可以从中进行过滤。

You are right, the number of corner cases is huge precisely because human language is imprecise and vague. 没错，极端案例的数量之所以庞大，恰恰是因为人类的语言不够精确和含糊。 Using a library that already accounts for these edge cases should save you a lot of headache. 使用已经考虑了这些极端情况的库可以为您省去很多麻烦。

A helpful primer on dealing with raw text in nltk is here . 这里是处理nltk中原始文本的有用入门。 It seems the most useful function for your use case is nltk.word_tokenize , which passes back a list of strings with words and punctuation separated. 在您的用例中，最有用的功能似乎是nltk.word_tokenize ，该函数将单词和标点符号分开的字符串列表传回。

Answer 2

Here's a Python regex that should work for splitting the sentences you provided. 这是一个Python正则表达式，可用于拆分您提供的句子。

((?<![A-Z])\.)*[\W](?<!\.)|[\W]$

Try it here 在这里尝试

Since all abbreviations with periods should have a capital letter before the period, we can use a negative lookbehind to exclude those periods: 由于所有带句点的缩写都应在句点前使用大写字母，因此我们可以在后面使用负号来排除这些句点：

((?<![A-Z])\.)*

Then splits on all other non-period non-alphanumerics: 然后拆分所有其他非周期非字母数字：

[\W](?<!\.)

or symbols at the end of a line: 或行尾的符号：

|[\W]$

I tested the regex on these strings: 我在这些字符串上测试了正则表达式：

The RN lives in the US

The RN, lives in the US here.

正则表达式从列表中删除不是AZ az的单词（例外）

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-11-30 19:27:15

解决方案2
0 2015-11-30 22:36:16

正则表达式从列表中删除不是AZ az的单词（例外）

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-11-30 19:27:15

解决方案2 0 2015-11-30 22:36:16

解决方案1
2 已采纳 2015-11-30 19:27:15

解决方案2
0 2015-11-30 22:36:16