[英]Remove specific words conditionnally in R
我正在嘗試根據特定條件刪除句子中的單詞列表。
假設我們有這個數據框:
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
> df
questions responses
[1,] "The highest mountain in the world" "The Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "A bird"
[4,] "Widest liquid area on earth" "The Pacific ocean"
以及以下特定單詞的列表:
articles <- c("The","A")
geowords <- c("mountain","liquid area")
我想做兩件事:
與以小寫字母開頭的單詞相鄰時, 刪除“響應”列中第一位置的文章
當(與以大寫字母開頭的單詞相鄰)且如果為IF(相應問題中包含地名)時,請 刪除“響應”列中第一位置的文章 。
預期結果應為:
questions responses
[1,] "The highest mountain in the world" "Himalaya"
[2,] "A cold war serie from 2013" "The Americans"
[3,] "A kiwi which is not a fruit" "bird"
[4,] "Widest liquid area on earth" "Pacific ocean"
我將對gsub嘗試失敗,因為我對regex一點都不熟悉...我在Stackoverflow中進行搜索時並未發現真正相似的問題。 如果R和regex全明星可以幫助我,我將非常感激!
與您提到的相同,它被寫為兩個邏輯列,而ifelse
用於驗證和gsub
:
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- data.frame(cbind(questions,responses), stringsAsFactors = F)
df
articles <- c("The ","A ")
geowords <- c("mountain","liquid area")
df$f_caps <- unlist(lapply(df$responses, function(x) {grepl('[A-Z]',str_split(str_split(x,' ', simplify = T)[2],'',simplify = T)[1])}))
df$geoword_flag <- grepl(paste(geowords,collapse='|'),df[,1])
df$new_responses <- ifelse((df$f_caps & df$geoword_flag) | !df$f_caps,
{gsub(paste(articles,collapse='|'),'', df$responses ) },
df$responses)
df$new_responses
> df$new_responses
[1] "Himalaya" "The Americans" "bird" "Pacific ocean"
為了好玩,這里有一個整潔的解決方案:
df2 <-
df %>%
as.tibble() %>%
mutate(responses =
#
if_else(str_detect(questions, geowords),
#
str_replace(string = responses,
pattern = regex("\\w+\\b\\s(?=[A-Z])"),
replacement = ""),
#
str_replace(string = responses,
pattern = regex("\\w+\\b\\s(?=[a-z])"),
replacement = ""))
)
編輯 :沒有“第一個單詞”正則表達式,靈感來自@Calvin Taylor
# Define articles
articles <- c("The", "A")
# Make it a regex alternation
art_or <- paste0(articles, collapse = "|")
# Before a lowercase / uppercase
art_upper <- paste0("(?:", art_or, ")", "\\s", "(?=[A-Z])")
art_lower <- paste0("(?:", art_or, ")", "\\s", "(?=[a-z])")
# Work on df
df4 <-
df %>%
as.tibble() %>%
mutate(responses =
if_else(str_detect(questions, geowords),
str_replace_all(string = responses,
pattern = regex(art_upper),
replacement = ""),
str_replace_all(string = responses,
pattern = regex(art_lower),
replacement = "")
)
)
我今天自學了一些R。 我使用了一個函數來獲得相同的結果。
#!/usr/bin/env Rscript
# References
# https://stackoverflow.com/questions/1699046/for-each-row-in-an-r-dataframe
responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)
articles <- c("The","A")
geowords <- c("mountain","liquid area")
common_pattern <- paste( "(?:", paste(articles, "", collapse = "|"), ")", sep = "")
pattern1 <- paste(common_pattern, "([a-z])", sep = "")
pattern2 <- paste(common_pattern, "([A-Z])", sep = "")
geo_pattern <- paste(geowords, collapse = "|")
f <- function (x){
q <- x[1]
r <- x[2]
a1 <- gsub (pattern1, "\\1", r)
if ( grepl(geo_pattern, q)){
a1 <- gsub (pattern2, "\\1", a1)
}
x[1] <- q
x[2] <- a1
}
apply (df, 1, f)
奔跑
Rscript stacko.R
[1] "Himalaya" "The Americans" "bird" "Pacific ocean"
您可以選擇將簡單的正則表達式與, grepl
和gsub
一起使用,如下所示:
df <- data.frame(cbind(questions,responses), stringsAsFactors = F) #Changing to data frame, since cbind gives a matrix, stringsAsFactors will prevent to not change the columns to factors
regx <- paste0(geowords, collapse="|") # The "or" condition between the geowords
articlegrep <- paste0(articles, collapse="|") # The "or" condition between the articles
df$responses <- ifelse(grepl(regx, df$questions)|grepl(paste0("(",articlegrep,")","\\s[a-z]"), df$responses),
gsub("\\w+ (.*)","\\1",df$responses),df$responses) #The if condition for which replacement has to happen
> print(df)
questions responses
#1 The highest mountain in the world Himalaya
#2 A cold war serie from 2013 The Americans
#3 A kiwi which is not a fruit bird
#4 Widest liquid area on earth Pacific ocean
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.