[英]Adding word boundary to python regex capture group
I am trying to preprocess text before parsing them to StanfordCoreNLP server.我正在尝试在将文本解析到 StanfordCoreNLP 服务器之前对其进行预处理。 Some of my text looks like this.
我的一些文字看起来像这样。
" Conversion of code written in C# to Visual Basic .NET (VB.NET)." “将 C# 中编写的代码转换为 Visual Basic .NET (VB.NET)。”
The ".NET" confuses the server because it appears as a period and makes the single sentence into two. “.NET”混淆了服务器,因为它显示为句号并将单个句子分成两个。 I wanted to replace '.'
我想替换“。” that appears in front of a word with 'DOT' so that sentence remains the same.
出现在带有“DOT”的单词前面,因此该句子保持不变。 Note that I don't want to change anything in 'VB.NET' because the StanfordCoreNLP recognizes that as one word (Proper noun).
请注意,我不想更改“VB.NET”中的任何内容,因为 StanfordCoreNLP 将其识别为一个词(专有名词)。
This is what I tried so far.这是我到目前为止所尝试的。
print(re.sub(r"\.(\S+)", r"DOT\g<0>", text))
The result looks like this.结果看起来像这样。
Conversion of code written in C# to Visual Basic DOT.NET (VBDOT.NET).
I tried adding word boundaries to the pattern r"\b\.(\S+)\b"
.我尝试将单词边界添加到模式
r"\b\.(\S+)\b"
。 It didn't work.它没有用。
Any help would be appreciated.任何帮助,将不胜感激。
You can use您可以使用
re.sub(r"\B\.\b", "DOT", text)
See the regex demo .请参阅正则表达式演示。
The \B\.\b
regex matches a dot that is either at the start of string or immediately preceded with a non-word char, and that is followed with a word char. \B\.\b
正则表达式匹配位于字符串开头或紧跟在非单词 char 之前的点,然后是单词 char。
See the Python demo :请参阅Python 演示:
import re
text = "Conversion of code written in C# to Visual Basic .NET (VB.NET)."
print( re.sub(r"\B\.\b", "DOT", text) )
# => Conversion of code written in C# to Visual Basic DOTNET (VB.NET).
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.