简体   繁体   English

Python正则表达式匹配-在标点符号上进行分割,但忽略某些单词

[英]Python Regex Matching - Splitting on punctuation but ignoring certain words

Suppose I have the following sentence, 假设我有以下一句话,

Hi, my name is Dr. Who. 嗨,我叫谁博士 I'm in love with fish-fingers and custard !! 我爱上了鱼指和蛋ust!

I'm trying to capture the punctuation (except the apostrophe and hyphen) using regular expressions, but I also want to ignore certain words. 我正在尝试使用正则表达式捕获标点符号(撇号和连字符除外),但我也想忽略某些单词。 For example, I'm ignoring Dr., and so I don't want to capture the . 例如,我无视Dr.,所以我不想捕获。 in the word Dr. 一词博士

Ideally, the regex should capture the text in between the parentheses: 理想情况下,正则表达式应捕获括号之间的文本:

Hi(, )my( )name( )is( )Dr.( )Who(. )I'm( )in( )love( )with( )fish-fingers( )and( )custard( !!) Hi(,)my()name()is()Dr。()Who(。)I'm()in()love()with()fish-fingerers()and()custard(!!)

Note that I have a Python list that contains words like "Dr." 请注意,我有一个包含“ Dr.”之类的单词的Python列表。 that I want to ignore. 我想忽略的 I'm also using string.punctuation to get a list of punctuation characters to use in the regex. 我还使用string.punctuation来获取要在正则表达式中使用的标点符号列表。 I've tried using negative lookahead but it was still catching the "." 我曾尝试使用否定的前瞻,但仍然遇到了“。”。 in Dr. Any help appreciated! 在博士。任何帮助表示赞赏!

you can throw away at first all your stop words (like "Dr.") and then all letters (and digits). 您可以先丢弃所有停用词(例如“博士”),然后丢弃所有字母(和数字)。

import re

text = "Hi, my name is Dr. Who. I'm in love with fish-fingers and custard !!"
tmp = re.sub(r'[Dr.|Prof.]', '', text)
print(re.sub('[a-zA-Z0-9]*', '', tmp))

Would that work? 那行得通吗?

it would print: 它会打印:

,      '    -   !!

The output is capturing the text in between the parentheses, in your question. 输出正在捕获您问题中括号之间的文本。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM