简体   繁体   English

从字符串中删除/替换特定的单词或短语-R

[英]remove/replace specific words or phrases from character strings - R

I looked around both here and elsewhere, I found many similar questions but none which exactly answer mine. 我在这里和其他地方四处张望,我发现了许多类似的问题,但没有一个问题能完全回答我的问题。 I need to clean up naming conventions, specifically replace/remove certain words and phrases from a specific column/variable, not the entire dataset. 我需要清理命名约定,特别是替换/删除特定列/变量中的某些单词和短语,而不是整个数据集。 I am migrating from SPSS to R, I have an example of the code to do this in SPSS below, but I am not sure how to do it in R. 我正在从SPSS迁移到R,下面有一个在SPSS中执行此操作的代码示例,但是我不确定如何在R中执行此操作。

EG: 例如:

"Acadia Parish" --> "Acadia" (removes Parish and space before Parish) “ Acadia教区”->“ Acadia”(删除教区和教区之前的空间)

"Fifth District" --> "Fifth" (removes District and space before District) “第五区”->“第五区”(删除区和在区之前的空间)

SPSS syntax: SPSS语法:

COMPUTE county=REPLACE(county,' Parish','').

There are only a few instances of this issue in the column with 32,000 cases, and what needs replacing/removing varies and the cases can repeat (there are dozens of instances of a phrase containing 'Parish'), meaning it's much faster to code what needs to be removed/replaced, it's not as simple or clean as a regular expression to remove all spaces, all characters after a specific word or character, all special characters, etc. And it must include leading spaces. 在该列中只有少数情况下有32,000个案例,并且需要替换/删除的内容各不相同,并且案例可以重复(包含“ Parish”的短语有数十个实例),这意味着编写代码的速度要快得多需要删除/替换,要删除所有空格,特定单词或字符之后的所有字符,所有特殊字符等,不如正则表达式那么简单或干净。它必须包含前导空格。

I have looked at the replace() gsub() and other similar commands in R, but they all involve creating vectors, or it seems like they do. 我已经看过R中的replace()gsub()和其他类似的命令,但是它们都涉及创建向量,或者看起来确实如此。 What I'd like is syntax that looks for characters I specify, which can include leading or trailing spaces, and replaces them with something I specify, which can include nothing at all, and if it does not find the specific characters, the case is unchanged. 我想要的是查找指定字符的语法,该字符可以包含前导或尾随空格,然后用我指定的字符替换它们,该字符可以不包含任何内容,如果找不到特定字符,则为不变。

Yes, I will end up repeating the same syntax many times, it's probably easier to create a vector but if possible I'd like to get the syntax I described, as there are other similar operations I need to do as well. 是的,我最终会重复多次相同的语法,创建向量可能更容易,但是如果可能的话,我想获取我描述的语法,因为我还需要执行其他类似的操作。

Thank you for looking. 谢谢您的光临。

Maybe I'm missing something but I don't see why you can't simply use conditionals in your regex expression, then trim out the annoying white space. 也许我缺少了一些东西,但是我不明白为什么不能在正则表达式中简单地使用条件,然后删掉烦人的空白。

string <- c("Arcadia Parish", "Fifth District")

bad_words <- c("Parish", "District") # Write all the words you want removed here!
bad_regex <- paste(bad_words, collapse = "|")

trimws( sub(bad_regex, "", string) )

# [1] "Arcadia" "Fifth" 
dataframename$varname <- gsub(" Parish","", dataframename$varname)
> x <- c("Acadia Parish", "Fifth District")
> x2 <- gsub("^(\\w*).*$", "\\1", x)
> x2
[1] "Acadia" "Fifth"

Legend: 传说:

  • ^ Start of pattern. ^模式开始。
  • () Group (or token). ()组(或令牌)。
  • \\w* One or more occurrences of word character more than 1 times. \\ w *一次或多次出现单词字符超过1次。
  • .* one or more occurrences of any character except new line \\n. 。*除换行\\ n之外,任何字符都会出现一次或多次。
  • $ end of pattern. $模式结束。
  • \\1 Returns group from regexp \\ 1从正则表达式返回组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM