简体   繁体   English

当连字符包围单个内部字符时,如何替换字内连字符

[英]How to substitute intra-word hyphens when hyphens surround a single inner character

I want to preserve intra-word hyphens in text prior to tokenizing it. 我想在标记之前保留文本中的单词连字符。 The strategy involves substituting the hyphens for a unique character, then replacing that unique character with hyphens after tokenizing. 该策略涉及将连字符替换为唯一字符,然后在标记化后用连字符替换该唯一字符。 Note: I'll ultimately use the Unicode class of Pd to catch all forms of dash character, but here I'm keeping it simple since I don't think that part is pertinent to the problem. 注意:我最终会使用Unicode的Pd类来捕获所有形式的破折号字符,但在这里我保持简单,因为我不认为该部分与问题相关。

Problem: It fails when a word contains multiple inner hyphens separating a single character. 问题:当一个单词包含多个分隔单个字符的内部连字符时,它会失败。

Examples and desired outcomes: 例子和期望的结果:

replaceDash <- function(x) gsub("(\\w)-(\\w)", "\\1§\\2", x)

# these are all OK
replaceDash("Hawaii-Five-O")  
## [1] "Hawaii§Five§O"
replaceDash("jack-of-all-trades")  
## [1] "jack§of§all§trades"
replaceDash("A-bomb")         
## [1] "A§bomb"
replaceDash("freakin-A")      
## [1] "freakin§A"

# not the desired outcome
replaceDash("jack-o-lantern")  # FAILS - should be "jack§o§lantern"
## [1] "jack§o-lantern"
replaceDash("Whack-a-Mole")    # FAILS - should be "Whack§a§Mole"
## [1] "Whack§a-Mole"

What regex patterns do I need for the first and second expressions of the gsub() ? gsub()的第一个和第二个表达式需要哪些正则表达式模式?

You can use a PCRE regex with a look-ahead that would check if a word character appears right after a hyphen, but would not consume it. 您可以使用具有前瞻功能的PCRE正则表达式,以检查单词字符是否出现在连字符后面,但不会消耗它。

replaceDash <- function(x) gsub("(\\w)-(?=\\w)", "\\1§", x, perl=T)

See IDEONE demo 请参阅IDEONE演示

So, (\\\\w) captures an alphanumeric symbol into Group 1 that is later inserted into the replacement result with the help of \\\\1 backreference, and with (?=\\\\w) we only make sure there is a word character, but the regex index stays at the hyphen, thus allowing the next match from that word character. 因此, (\\\\w)将一个字母数字符号捕获到组1中,稍后在\\\\1反向引用的帮助下插入到替换结果中,并且使用(?=\\\\w)我们只确保有一个单词字符,但正则表达式索引保持在连字符处,从而允许该单词字符的下一个匹配。

you did not specify what kind of regex capabilities are allowed. 您没有指定允许哪种正则表达式功能。 Here is a pattern using zero-with look around: 这是一个使用零的模式 - 环顾四周:

gsub("(?<=\\w)-(?=\\w)", "§", "jack-o-lantern");
# jack§o§trade

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM