将字边界添加到 python 正则表达式捕获组

Question

I am trying to preprocess text before parsing them to StanfordCoreNLP server.我正在尝试在将文本解析到 StanfordCoreNLP 服务器之前对其进行预处理。 Some of my text looks like this.我的一些文字看起来像这样。

" Conversion of code written in C# to Visual Basic .NET (VB.NET)." “将 C# 中编写的代码转换为 Visual Basic .NET (VB.NET)。”

The ".NET" confuses the server because it appears as a period and makes the single sentence into two. “.NET”混淆了服务器，因为它显示为句号并将单个句子分成两个。 I wanted to replace '.'我想替换“。” that appears in front of a word with 'DOT' so that sentence remains the same.出现在带有“DOT”的单词前面，因此该句子保持不变。 Note that I don't want to change anything in 'VB.NET' because the StanfordCoreNLP recognizes that as one word (Proper noun).请注意，我不想更改“VB.NET”中的任何内容，因为 StanfordCoreNLP 将其识别为一个词（专有名词）。

This is what I tried so far.这是我到目前为止所尝试的。

print(re.sub(r"\.(\S+)", r"DOT\g<0>", text))

The result looks like this.结果看起来像这样。

Conversion of code written in C# to Visual Basic DOT.NET (VBDOT.NET).

I tried adding word boundaries to the pattern r"\b\.(\S+)\b" .我尝试将单词边界添加到模式r"\b\.(\S+)\b" 。 It didn't work.它没有用。

Any help would be appreciated.任何帮助，将不胜感激。

Answer 1

You can use您可以使用

re.sub(r"\B\.\b", "DOT", text)

See the regex demo .请参阅正则表达式演示。

The \B\.\b regex matches a dot that is either at the start of string or immediately preceded with a non-word char, and that is followed with a word char. \B\.\b正则表达式匹配位于字符串开头或紧跟在非单词 char 之前的点，然后是单词 char。

See the Python demo :请参阅Python 演示：

import re
text = "Conversion of code written in C# to Visual Basic .NET (VB.NET)."
print( re.sub(r"\B\.\b", "DOT", text) )
# => Conversion of code written in C# to Visual Basic DOTNET (VB.NET).

将字边界添加到 python 正则表达式捕获组

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-09 22:40:36

将字边界添加到 python 正则表达式捕获组

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-09 22:40:36

解决方案1
1 已采纳 2020-12-09 22:40:36