简体   繁体   English

Python查找所有出现连字的单词并在位置处替换

[英]Python find all occurrences of hyphenated word and replace at position

I have to replace all occurrences of patterns with hyphen like cccc-come or oh-oh-oh-oh , etc. with the last token ie come or oh in this example, where 我有一个连字符替换的模式所有出现像cccc-comeoh-oh-oh-oh ,等等与最后一个记号即comeoh在这个例子中,在

  • The number of character between hyphen is arbitrary, it can be one ore more characters 连字符之间的字符数是任意的,可以是一个或多个字符
  • the token to match is the last token in the hyphenation, hence come in cc-come . 令牌匹配是在连字的最后一个令牌,因此comecc-come
  • the input string may have one or more occurrences of it like the following sentences: 输入字符串可能有一个或多个出现,如以下句子:

    cccc-come to home today cccc-come to me

    oh-oh-oh-oh it's a bad life oh-oh-oh-oh

  • Need to find the start and end position of the matched token via finditer 需要通过finditer匹配令牌的开始和结束位置

     r = re.compile(pattern, flags=re.I | re.X | re.UNICODE) for m in r.finditer(text): word=m.group() characterOffsetBegin=m.start() characterOffsetEnd=m.end() # now replace and store indexes 

[UPDATE] [UPDATE]

Assumed that those hyphenated words does not belong to a fixed dictionary, I'm adding this constraint to it: 假设那些带连字符的单词不属于固定词典,那么我要向其添加以下约束:

  • The number of character between hyphen must range from a minimum to a max, like {1,3} so that the capture group must match c-come , or cc-come , but not a hyphenated real word like fine-tuning or like inter-face , etc. 连字符之间的字符数必须在最小到最大范围内,例如{1,3}以便捕获组必须匹配c-comecc-come ,但不能与诸如fine-tuninginter-face

You can just use re.sub() to replace all without having to iterate over matched indices: 您只需使用re.sub()即可替换所有内容,而不必迭代匹配的索引:

import re

s = 'c-c-c-c-come to home today c-c-c-c-come to me'

print(re.sub(r'(\w+(?:-))+(\w+)', '\\2', s))
# come to home today come to me

Here is one possible expression: 这是一个可能的表达式:

import re

text = ("c-c-c-c-come to home today c-c-c-c-come to me, "
        "oh-oh-oh-oh it's a bad life oh-oh-oh-oh")
pattern = r"(?<=-)\w+(?=[^-\w])"
r = re.compile(pattern, flags=re.I | re.X | re.UNICODE)
for m in r.finditer(text):
    word = m.group()
    characterOffsetBegin = m.start()
    print(word, characterOffsetBegin)

Output: 输出:

come 8
come 35
oh 56

An option using a capturing group and a backreference might be: 使用捕获组和反向引用的选项可能是:

(?<!\S)(\w{2,3})(?:-\1)*-(\w+)(?!\S)

That will match: 这将匹配:

  • (?<!\\S) Negative lookbehind, assert what is on the left is not a non whitespace char (?<!\\S)负向后看,断言左侧的内容不是非空格字符
  • (\\w{2,3}) Capture in group 1 two or three times a word char (\\w{2,3})在组1中捕获一个单词char的两倍或三倍
  • (?:-\\1)* Repeat 0+ times matching a hyphen followed by a backreference to what is matched in group 1 (?:-\\1)*重复0+次匹配连字符,然后反向引用组1中匹配的内容
  • -(\\w+) Match - followed by matching 1+ word chars in group 2 -(\\w+)匹配-随后匹配组2中的1个以上的字符字符
  • (?!\\S) Negative lookahead, assert what is on the right is not a non whitespace char (?!\\S)负向超前,断言右侧的内容不是非空格字符

In the replacement use the second capturing group \\\\2 or r'\\2 在替换中,使用第二个捕获组\\\\2r'\\2

Regex demo | 正则表达式演示 | Python demo Python演示

For example 例如

import re

text = "c-c-c-c-come oh-oh-oh-oh it's a bad life oh-oh-oh-oh"
pattern = r"(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)"
text = re.sub(pattern, r'\2', text)
print(text)

Result 结果

come oh it's a bad life oh

It can be done without regular expressions. 无需正则表达式即可完成。 Code: 码:

s = "c-c-c-c-come to home today c-c-c-c-come to me"
s = " ".join(w if "-" not in w else w[w.rindex('-') + 1:] for w in s.split(" "))

Output: 输出:

come to home today come to me

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM