简体   繁体   English

将字边界添加到 python 正则表达式捕获组

[英]Adding word boundary to python regex capture group

I am trying to preprocess text before parsing them to StanfordCoreNLP server.我正在尝试在将文本解析到 StanfordCoreNLP 服务器之前对其进行预处理。 Some of my text looks like this.我的一些文字看起来像这样。

" Conversion of code written in C# to Visual Basic .NET (VB.NET)." “将 C# 中编写的代码转换为 Visual Basic .NET (VB.NET)。”

The ".NET" confuses the server because it appears as a period and makes the single sentence into two. “.NET”混淆了服务器,因为它显示为句号并将单个句子分成两个。 I wanted to replace '.'我想替换“。” that appears in front of a word with 'DOT' so that sentence remains the same.出现在带有“DOT”的单词前面,因此该句子保持不变。 Note that I don't want to change anything in 'VB.NET' because the StanfordCoreNLP recognizes that as one word (Proper noun).请注意,我不想更改“VB.NET”中的任何内容,因为 StanfordCoreNLP 将其识别为一个词(专有名词)。

This is what I tried so far.这是我到目前为止所尝试的。

print(re.sub(r"\.(\S+)", r"DOT\g<0>", text))

The result looks like this.结果看起来像这样。

Conversion of code written in C# to Visual Basic DOT.NET (VBDOT.NET).

I tried adding word boundaries to the pattern r"\b\.(\S+)\b" .我尝试将单词边界添加到模式r"\b\.(\S+)\b" It didn't work.它没有用。

Any help would be appreciated.任何帮助,将不胜感激。

You can use您可以使用

re.sub(r"\B\.\b", "DOT", text)

See the regex demo .请参阅正则表达式演示

The \B\.\b regex matches a dot that is either at the start of string or immediately preceded with a non-word char, and that is followed with a word char. \B\.\b正则表达式匹配位于字符串开头或紧跟在非单词 char 之前的点,然后是单词 char。

See the Python demo :请参阅Python 演示

import re
text = "Conversion of code written in C# to Visual Basic .NET (VB.NET)."
print( re.sub(r"\B\.\b", "DOT", text) )
# => Conversion of code written in C# to Visual Basic DOTNET (VB.NET).

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM