[英]Python regex loop skipping every third item
I'm doing a tokenizer and I want to separate strings like "word-bound-with-hyphen" into "word xxsep bound xxsep with xxsep hyphen". 我正在做一个标记化器,我想把像“word-bound-with-hyphen”这样的字符串分成“xxsep绑定xxsep和xxsep连字符”。
I tried this: 我试过这个:
import re
s = "words-bound-with-hyphen"
reg_m = re.compile("[\w\d]+-[\w\d]+")
reg = re.compile("([\w\d]+)-([\w\d]+)")
while(reg_m.match(s)):
s = reg.sub(r"\1 xxsep \2", s)
print(s) #prints "words xxsep bound-with xxsep hyphen"
But this leaves every third hyphen-bound word. 但这留下了每个连字符的第三个字。
You could just replace the hyphens with a regex: 你可以用正则表达式替换连字符:
In [4]: re.sub("-", " xxsep ", "word-bound-with-hyphen")
Out[4]: 'word xxsep bound xxsep with xxsep hyphen'
or with string substitution: 或者用字符串替换:
In [7]: "word-bound-with-hyphen".replace("-", " xxsep ")
Out[7]: 'word xxsep bound xxsep with xxsep hyphen'
The reason your current approach doesn't work is that re.sub()
returns non-overlapping groups whereas word-bound
overlaps with bound-with
overlaps with with-hyphen
. 您当前方法不起作用的原因是
re.sub()
返回非重叠组,而word-bound
重叠与bound-with
重叠与with-hyphen
。
If you don't want to just replace all hyphens but only those that are preceded and followed by certain characters than use regex lookbacks and lookaheads. 如果您不想仅替换所有连字符,而只想替换某些字符之前和之后的连字符,而不是使用正则表达式回溯和前瞻。
import re
s = "words-bound-with-hyphen"
re.sub('(?<=[\w\d])-(?=[\w\d])',' xxsep ', s)
# result: 'words xxsep bound xxsep with xxsep hyphen'
import re
s = "words-bound-with-hyphen"
re.sub('-',' xxsep ',s)
or without using regular expressions 或不使用正则表达式
" xxsep ".join(x.split('-'))
here, the list will be separated taking - as delimiter and then joined using "xxsep" 在这里,列表将分隔 - 作为分隔符,然后使用“xxsep”加入
Why not use word boundaries . 为什么不使用单词边界 。 Search for
\\b-\\b
and replace with xxsep
. 搜索
\\b-\\b
并替换为xxsep
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.