Python查找所有出现连字的单词并在位置处替换

Question

I have to replace all occurrences of patterns with hyphen like cccc-come or oh-oh-oh-oh , etc. with the last token ie come or oh in this example, where 我有一个连字符替换的模式所有出现像cccc-come或oh-oh-oh-oh ，等等与最后一个记号即come或oh在这个例子中，在

The number of character between hyphen is arbitrary, it can be one ore more characters 连字符之间的字符数是任意的，可以是一个或多个字符
the token to match is the last token in the hyphenation, hence come in cc-come . 令牌匹配是在连字的最后一个令牌，因此come在cc-come 。
the input string may have one or more occurrences of it like the following sentences: 输入字符串可能有一个或多个出现，如以下句子：
cccc-come to home today cccc-come to me

oh-oh-oh-oh it's a bad life oh-oh-oh-oh

Need to find the start and end position of the matched token via finditer 需要通过finditer匹配令牌的开始和结束位置

 r = re.compile(pattern, flags=re.I | re.X | re.UNICODE) for m in r.finditer(text): word=m.group() characterOffsetBegin=m.start() characterOffsetEnd=m.end() # now replace and store indexes

[UPDATE] [UPDATE]

Assumed that those hyphenated words does not belong to a fixed dictionary, I'm adding this constraint to it: 假设那些带连字符的单词不属于固定词典，那么我要向其添加以下约束：

The number of character between hyphen must range from a minimum to a max, like {1,3} so that the capture group must match c-come , or cc-come , but not a hyphenated real word like fine-tuning or like inter-face , etc. 连字符之间的字符数必须在最小到最大范围内，例如{1,3}以便捕获组必须匹配c-come或cc-come ，但不能与诸如fine-tuning或inter-face等

Answer 1

You can just use re.sub() to replace all without having to iterate over matched indices: 您只需使用re.sub()即可替换所有内容，而不必迭代匹配的索引：

import re

s = 'c-c-c-c-come to home today c-c-c-c-come to me'

print(re.sub(r'(\w+(?:-))+(\w+)', '\\2', s))
# come to home today come to me

Answer 2

Here is one possible expression: 这是一个可能的表达式：

import re

text = ("c-c-c-c-come to home today c-c-c-c-come to me, "
        "oh-oh-oh-oh it's a bad life oh-oh-oh-oh")
pattern = r"(?<=-)\w+(?=[^-\w])"
r = re.compile(pattern, flags=re.I | re.X | re.UNICODE)
for m in r.finditer(text):
    word = m.group()
    characterOffsetBegin = m.start()
    print(word, characterOffsetBegin)

Output: 输出：

come 8
come 35
oh 56

Answer 3

An option using a capturing group and a backreference might be: 使用捕获组和反向引用的选项可能是：

(?<!\S)(\w{2,3})(?:-\1)*-(\w+)(?!\S)

That will match: 这将匹配：

(?<!\\S) Negative lookbehind, assert what is on the left is not a non whitespace char (?<!\\S)负向后看，断言左侧的内容不是非空格字符
(\\w{2,3}) Capture in group 1 two or three times a word char (\\w{2,3})在组1中捕获一个单词char的两倍或三倍
(?:-\\1)* Repeat 0+ times matching a hyphen followed by a backreference to what is matched in group 1 (?:-\\1)*重复0+次匹配连字符，然后反向引用组1中匹配的内容
-(\\w+) Match - followed by matching 1+ word chars in group 2 -(\\w+)匹配-随后匹配组2中的1个以上的字符字符
(?!\\S) Negative lookahead, assert what is on the right is not a non whitespace char (?!\\S)负向超前，断言右侧的内容不是非空格字符

In the replacement use the second capturing group \\\\2 or r'\\2 在替换中，使用第二个捕获组\\\\2或r'\\2

Regex demo | 正则表达式演示 | Python demo Python演示

For example 例如

import re

text = "c-c-c-c-come oh-oh-oh-oh it's a bad life oh-oh-oh-oh"
pattern = r"(?<!\S)(\w{1,3})(?:-\1)*-(\w+)(?!\S)"
text = re.sub(pattern, r'\2', text)
print(text)

Result 结果

come oh it's a bad life oh

Answer 4

It can be done without regular expressions. 无需正则表达式即可完成。 Code: 码：

s = "c-c-c-c-come to home today c-c-c-c-come to me"
s = " ".join(w if "-" not in w else w[w.rindex('-') + 1:] for w in s.split(" "))

Output: 输出：

come to home today come to me

Python查找所有出现连字的单词并在位置处替换

问题描述

4 个解决方案

解决方案1
4 2019-06-04 16:46:06

解决方案2
1 2019-06-04 16:48:31

解决方案3
1 已采纳 2019-06-04 16:59:33

解决方案4
0 2019-06-04 16:49:00

Python查找所有出现连字的单词并在位置处替换

问题描述

4 个解决方案

解决方案1 4 2019-06-04 16:46:06

解决方案2 1 2019-06-04 16:48:31

解决方案3 1 已采纳 2019-06-04 16:59:33

解决方案4 0 2019-06-04 16:49:00

解决方案1
4 2019-06-04 16:46:06

解决方案2
1 2019-06-04 16:48:31

解决方案3
1 已采纳 2019-06-04 16:59:33

解决方案4
0 2019-06-04 16:49:00