简体   繁体   English

gsub 替换并保留大小写

[英]gsub replace and preserve case

I've been using gsub to abbreviate words in longer strings.我一直在使用 gsub 来缩写较长字符串中的单词。 I'd like to abbreviate a word and then inherit as much of the capitalization of the input as I can.我想缩写一个单词,然后尽可能多地继承输入的大写。

Example, turn hello to hi in this:例如,在此将 hello 变为 hi:

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")

But respect the case of hello in the original但是尊重原文中hello的情况

c("Hi World", "HI WORLD", "hi world", "hI world")

Most of the examples I really want to match are "HI" "hi" and "Hi".我真正想匹配的大多数示例是“HI”“hi”和“Hi”。 I don't care so much about "hI", but for completeness, I leave that as a possibility.我不太关心“hI”,但为了完整起见,我将其视为一种可能性。

To get this done until now, I have the tedious approach of maintaining vectors of strings of targets and replacements到目前为止,为了完成这项工作,我采用了繁琐的方法来维护目标和替换字符串的向量

xin <- c("Hello\ ", "HELLO\ ", "hello\ ", "hElLo\ ")
xout <- c("Hi ", "HI ", "hi ", "hI ")
mapply(gsub, xin, xout, x)

That gives a correct answer, see:这给出了正确的答案,请参阅:

     Hello      HELLO      hello      hElLo
"Hi World" "HI WORLD" "hi world" "hI world"

But this is embarrassing and time consuming and inflexible!但这既尴尬又费时而且不灵活! So far, I have a family of 50 words for which we seek abbreviation, and keeping all of the case combinations is tiresome.到目前为止,我有一个包含 50 个单词的家庭,我们寻求缩写,并且保留所有大小写组合令人厌烦。

The data is full of mixed-case data chaos because humans typed in about 78000 records and they capitalized words like department and university in every conceivable way.数据充满了混合大小写的数据混乱,因为人类输入了大约 78000 条记录,并且他们以各种可能的方式将部门和大学等单词大写。 The long sentences they typed don't fit in the space allowed on the printed page, and we are asked to shorten them to "dept" and "univ".他们输入的长句不适合打印页面上允许的空间,我们被要求将它们缩短为“dept”和“univ”。 We want to preserve the capitalization if possible.如果可能,我们希望保留大写。

The only idea I have looks not much like R to me.我唯一的想法在我看来不太像 R。 Split the original input, tabulate the existing capitalization for the first 2 letters.拆分原始输入,将前 2 个字母的现有大写制成表格。

xcap <- sapply(strsplit(x, split = ""), function(x) x %in% LETTERS)[1:2, ]
> t(xcap)
      [,1]  [,2]
[1,]  TRUE FALSE
[2,]  TRUE  TRUE
[3,] FALSE FALSE
[4,] FALSE  TRUE

I'm pretty sure I could use that capitalization information to make this work right.我很确定我可以使用这些大写信息来使这项工作正常进行。 But I haven't yet succeeded.但我还没有成功。 I've just become aware of G Grothendieck's package gsubfn which might work, but the terminology there ("proto" objects) is new to me.我刚刚意识到 G Grothendieck 的包 gsubfn 可能有用,但那里的术语(“原型”对象)对我来说是新的。

I'll keep going in that direction, probably, but am asking now if there is a more direct route.我可能会继续朝那个方向前进,但我现在问是否有更直接的路线。

pj pj

Your idea inspired me to write this code.你的想法激发了我编写这段代码。 Its done in one sapply block.它在一个 sapply 块中完成。 toupper function is used to capitalize splitted characters of xout string. toupper 函数用于将 xout 字符串的拆分字符大写。

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")

sapply(x, function(x,xout) {
  xcap<-(unlist(strsplit(unlist(strsplit(x," "))[1],"")) %in% LETTERS)
  n<-nchar(xout)
  if(length(xcap)>=n) {
   xcap<-xcap[1:n]
  }else {
    xcap<-c(xcap,rep(tail(xcap,1),n-length(xcap)))
    }
  xout<-paste(sapply(1:n,function(x) {
    if(xcap[x]) toupper(unlist(strsplit(xout,""))[x])
    else unlist(strsplit(xout,""))[x]
    }),sep = "",collapse = "")
  xin<-"hello"
  gsub(xin,xout,x[1],ignore.case = T)
  },xout="selamlar")

[output with "selamlar"]
 Hello World      HELLO WORLD      hello world      hElLo world 
"Selamlar World" "SELAMLAR WORLD" "selamlar world" "sElAmlar world" 

[output with "hi"]
Hello World HELLO WORLD hello world hElLo world 
"Hi World"  "HI WORLD"  "hi world"  "hI world" 

I tried to post this as comment on above, but exceed the word limit.我试图将此作为上面的评论发布,但超出了字数限制。 OK to start new answer?确定开始新答案吗?

Here's the solution we are using.这是我们正在使用的解决方案。 This takes the idea that @vck proposed and wraps it in some functions that clear up input and output.这采用了@vck 提出的想法,并将其包装在一些清除输入和输出的函数中。 This still feels a bit kludgey to me, but the top priority was getting something that works in a way we can understand.这对我来说仍然有点笨拙,但首要任务是以我们可以理解的方式获得一些东西。 The gsubfn based avenues were not.基于 gsubfn 的途径不是。

##' abbreviate words within strings, but preserve case of input
##'
##' Problem described at
##' http://stackoverflow.com/questions/32304688/gsub-replace-and-preserve-case
##' Please notify me of examples that fail
##' @param y vector of target words to be abbreviated
##' @param old replacements for target words.  must match old
##' @param new replacements for target words.  must match old
##' vector length.
##' @return vector of abbreviated words 
##' @author Paul Johnson <pauljohn@@ku.edu>
stabbr <- function(y = NULL, old = NULL, new = NULL){
    stopifnot(length(old) == length(new))
    transfwrap <- function(xxin, xxout, xx){
        sapply(xx, transf, xin = xxin, xout = xxout)
    }

    transf <- function(x, xin, xout) {
        xin <- tolower(xin)
        xcap <- (unlist(strsplit(unlist(strsplit(x," "))[1],"")) %in% LETTERS)
        n <- nchar(xout)
        if(length(xcap) >= n) {
            xcap<-xcap[1:n]
        } else {
            xcap <- c(xcap, rep(tail(xcap,1), n-length(xcap)))
        }
        xout2 <- paste(sapply(1:n,function(x) {
            if (xcap[x]) toupper(unlist(strsplit(xout,""))[x])
            else unlist(strsplit(xout,""))[x]
        }), sep = "", collapse = "")
        gsub(xin, xout2, x[1], ignore.case = T)
    }

    for (i in seq_along(old)){
        y <- transfwrap(old[i], new[i], y)
    }
    y
}

Example usages:示例用法:

x <- c("Hello World", "HELLO WORLD", "hello world", "hElLo world")
xin <- c("Hello", "world")
xout <- c("hi", "wrld")
stabbr(x, xin, xout)

## Hello World HELLO WORLD hello world hElLo world 
##   "Hi Wrld"   "HI WRLD"   "hi wrld"   "hI wRLD" 
x <- c("Department of Ornithology", "DEPARTMENT of ORNITHOLOGY",
       "Dept of Ornith")
xin <- c("Department", "Ornithology")
xout <- c("Dept", "Orni")
res <- stabbr(x, xin, xout)
cbind(x, res)

##                      x                           res             
##Department of Ornithology "Department of Ornithology" "Dept of Orni"  
## DEPARTMENT of ORNITHOLOGY "DEPARTMENT of ORNITHOLOGY" "DEPT of ORNI"  
## Dept of Ornith            "Dept of Ornith"            "Dept of Ornith"

## Tolerates regular expressions.
## Suppose you want to change Department only at first word?
x <- c("Department of Ornithology", "DEPARTMENT of ORNITHOLOGY",
       "Dept of Ornith", "Ornithology Department")
## Aiming here for Department only as first word
xin <- c("^Department", " Ornithology")
xout <- c("Dept", " Orni")
res <- stabbr(x, xin, xout)
res

There is a nice side effect of this approach.这种方法有一个很好的副作用。 The output is a named vector that uses the input names.输出是使用输入名称的命名向量。

##    Department of Ornithology DEPARTMENT of ORNITHOLOGY  
##           "Dept of Orni"            "DEPT of ORNI" 
##
##           Dept of Ornith    Ornithology Department 
##          "Dept of Ornith"  "Ornithology Department" 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM