繁体   English   中英

从R中的数据框中的列中删除字符串中的某些单词

[英]Remove certain words in string from column in dataframe in R

我在R中有一个数据集,列出了一堆公司名称,并希望删除像“公司”,“公司”,“有限责任公司”等字样,以便进行清理工作。 我有以下示例数据:

样本数据

  Location             Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm LLC
3 Miami, FL            Smith & Co.

我不希望在输出中包含的单词:

stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")

我构建了以下函数来分解每个单词,删除停用词,然后将单词重新组合在一起,但它不会遍历数据集的每一行。

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(sampleData$Company,stopwords)

上述函数的输出如下所示:

[1] "XYZ Company Consulting Firm Smith"

输出应该是:

 Location              Company
1 New York, NY         XYZ Company
2 Chicago, IL          Consulting Firm
3 Miami, FL            Smith

任何帮助,将不胜感激。

我们可以使用'tm'包

library(tm)

stopwords = readLines('stopwords.txt')     #Your stop words file
x  = df$company        #Company column data
x  =  removeWords(x,stopwords)     #Remove stopwords

df$company_new <- x     #Add the list as new column and check

稍微检查一下停用词(在公司中插入“\\”以避免正则表达式,空格):(但如果您不想留意停用词,则应优先选择上一个答案)

 stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")

 gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company"      "Consulting Firm " "Smith "       

df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
#      Location          Company
#1 New York, NY      XYZ Company
#2  Chicago, IL Consulting Firm 
#3    Miami, FL           Smith 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM