简体   繁体   English

Python正则表达式循环跳过每三个项目

[英]Python regex loop skipping every third item

I'm doing a tokenizer and I want to separate strings like "word-bound-with-hyphen" into "word xxsep bound xxsep with xxsep hyphen". 我正在做一个标记化器,我想把像“word-bound-with-hyphen”这样的字符串分成“xxsep绑定xxsep和xxsep连字符”。

I tried this: 我试过这个:

import re

s = "words-bound-with-hyphen"
reg_m = re.compile("[\w\d]+-[\w\d]+")
reg = re.compile("([\w\d]+)-([\w\d]+)")
while(reg_m.match(s)):
    s = reg.sub(r"\1 xxsep \2", s)
print(s) #prints "words xxsep bound-with xxsep hyphen"

But this leaves every third hyphen-bound word. 但这留下了每个连字符的第三个字。

You could just replace the hyphens with a regex: 你可以用正则表达式替换连字符:

In [4]: re.sub("-", " xxsep ", "word-bound-with-hyphen")
Out[4]: 'word xxsep bound xxsep with xxsep hyphen'

or with string substitution: 或者用字符串替换:

In [7]: "word-bound-with-hyphen".replace("-", " xxsep ")
Out[7]: 'word xxsep bound xxsep with xxsep hyphen'

The reason your current approach doesn't work is that re.sub() returns non-overlapping groups whereas word-bound overlaps with bound-with overlaps with with-hyphen . 您当前方法不起作用的原因是re.sub() 返回非重叠组,word-bound重叠与bound-with重叠与with-hyphen

If you don't want to just replace all hyphens but only those that are preceded and followed by certain characters than use regex lookbacks and lookaheads. 如果您不想仅替换所有连字符,而只想替换某些字符之前和之后的连字符,而不是使用正则表达式回溯和前瞻。

import re
s = "words-bound-with-hyphen"
re.sub('(?<=[\w\d])-(?=[\w\d])',' xxsep ', s)
# result: 'words xxsep bound xxsep with xxsep hyphen'
import re
s = "words-bound-with-hyphen"
re.sub('-',' xxsep ',s)

or without using regular expressions 或不使用正则表达式

" xxsep ".join(x.split('-'))

here, the list will be separated taking - as delimiter and then joined using "xxsep" 在这里,列表将分隔 - 作为分隔符,然后使用“xxsep”加入

Why not use word boundaries . 为什么不使用单词边界 Search for \\b-\\b and replace with xxsep . 搜索\\b-\\b并替换为xxsep

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM