[英]Remove certain words in string from column in dataframe in R
我在R中有一個數據集,列出了一堆公司名稱,並希望刪除像“公司”,“公司”,“有限責任公司”等字樣,以便進行清理工作。 我有以下示例數據:
樣本數據
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm LLC
3 Miami, FL Smith & Co.
我不希望在輸出中包含的單詞:
stopwords = c("Inc","inc","co","Co","Inc.","Co.","LLC","Corporation","Corp","&")
我構建了以下函數來分解每個單詞,刪除停用詞,然后將單詞重新組合在一起,但它不會遍歷數據集的每一行。
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(sampleData$Company,stopwords)
上述函數的輸出如下所示:
[1] "XYZ Company Consulting Firm Smith"
輸出應該是:
Location Company
1 New York, NY XYZ Company
2 Chicago, IL Consulting Firm
3 Miami, FL Smith
任何幫助,將不勝感激。
我們可以使用'tm'包
library(tm)
stopwords = readLines('stopwords.txt') #Your stop words file
x = df$company #Company column data
x = removeWords(x,stopwords) #Remove stopwords
df$company_new <- x #Add the list as new column and check
稍微檢查一下停用詞(在公司中插入“\\”以避免正則表達式,空格):(但如果您不想留意停用詞,則應優先選擇上一個答案)
stopwords = c("Inc","inc","co ","Co ","Inc."," Co\\.","LLC","Corporation","Corp","&")
gsub(paste0(stopwords,collapse = "|"),"", df$Company)
[1] "XYZ Company" "Consulting Firm " "Smith "
df$Company <- gsub(paste0(stopwords,collapse = "|"),"", df$Company)
# df
# Location Company
#1 New York, NY XYZ Company
#2 Chicago, IL Consulting Firm
#3 Miami, FL Smith
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.