有條件地刪除R中的特定單詞

Question

我正在嘗試根據特定條件刪除句子中的單詞列表。

假設我們有這個數據框：

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)

> df
     questions                           responses          
[1,] "The highest mountain in the world" "The Himalaya"     
[2,] "A cold war serie from 2013"        "The Americans"    
[3,] "A kiwi which is not a fruit"       "A bird"           
[4,] "Widest liquid area on earth"       "The Pacific ocean"

以及以下特定單詞的列表：

articles <- c("The","A")
geowords <- c("mountain","liquid area")

我想做兩件事：

與以小寫字母開頭的單詞相鄰時， 刪除“響應”列中第一位置的文章
當（與以大寫字母開頭的單詞相鄰）且如果為IF（相應問題中包含地名）時，請 刪除“響應”列中第一位置的文章 。

預期結果應為：

     questions                           responses      
[1,] "The highest mountain in the world" "Himalaya"     
[2,] "A cold war serie from 2013"        "The Americans"
[3,] "A kiwi which is not a fruit"       "bird"         
[4,] "Widest liquid area on earth"       "Pacific ocean"

我將對gsub嘗試失敗，因為我對regex一點都不熟悉...我在Stackoverflow中進行搜索時並未發現真正相似的問題。 如果R和regex全明星可以幫助我，我將非常感激！

Answer 1

與您提到的相同，它被寫為兩個邏輯列，而ifelse用於驗證和gsub ：

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- data.frame(cbind(questions,responses), stringsAsFactors = F)

df

articles <- c("The ","A ")
geowords <- c("mountain","liquid area")


df$f_caps <- unlist(lapply(df$responses, function(x) {grepl('[A-Z]',str_split(str_split(x,' ', simplify = T)[2],'',simplify = T)[1])}))


df$geoword_flag <- grepl(paste(geowords,collapse='|'),df[,1])


df$new_responses <- ifelse((df$f_caps & df$geoword_flag) | !df$f_caps, 
                     {gsub(paste(articles,collapse='|'),'', df$responses )  },
                     df$responses)

df$new_responses


> df$new_responses
[1] "Himalaya"      "The Americans" "bird"          "Pacific ocean"

Answer 2

為了好玩，這里有一個整潔的解決方案：

df2 <-
df %>%
as.tibble() %>%
  mutate(responses =
        #
        if_else(str_detect(questions, geowords),
                #
                str_replace(string = responses,
                            pattern = regex("\\w+\\b\\s(?=[A-Z])"),
                            replacement = ""),
                #
                str_replace(string = responses,
                            pattern = regex("\\w+\\b\\s(?=[a-z])"),
                            replacement = ""))
        )

編輯：沒有“第一個單詞”正則表達式，靈感來自@Calvin Taylor

# Define articles
articles <- c("The", "A")

# Make it a regex alternation
art_or <- paste0(articles, collapse = "|")

# Before a lowercase / uppercase
art_upper <- paste0("(?:", art_or, ")", "\\s", "(?=[A-Z])")
art_lower <- paste0("(?:", art_or, ")", "\\s", "(?=[a-z])")

# Work on df
df4 <-
  df %>%
  as.tibble() %>%
  mutate(responses =
        if_else(str_detect(questions, geowords),
                str_replace_all(string = responses,
                                pattern = regex(art_upper),
                                replacement = ""),
                str_replace_all(string = responses,
                                pattern = regex(art_lower),
                                replacement = "")
                )
        )

Answer 3

我今天自學了一些R。 我使用了一個函數來獲得相同的結果。

#!/usr/bin/env Rscript

# References
# https://stackoverflow.com/questions/1699046/for-each-row-in-an-r-dataframe

responses <- c("The Himalaya", "The Americans", "A bird", "The Pacific ocean")
questions <- c("The highest mountain in the world","A cold war serie from 2013","A kiwi which is not a fruit", "Widest liquid area on earth")
df <- cbind(questions,responses)

articles <- c("The","A")
geowords <- c("mountain","liquid area")

common_pattern <- paste( "(?:", paste(articles, "", collapse = "|"), ")", sep = "")
pattern1 <- paste(common_pattern, "([a-z])", sep = "")
pattern2 <- paste(common_pattern, "([A-Z])", sep = "")
geo_pattern <- paste(geowords, collapse = "|")

f <- function (x){ 
  q <- x[1]
  r <- x[2]
  a1 <- gsub (pattern1, "\\1", r)
  if ( grepl(geo_pattern, q)){
    a1 <- gsub (pattern2, "\\1", a1)
  }
  x[1] <- q
  x[2] <- a1
}

apply (df, 1, f)

奔跑

Rscript stacko.R
[1] "Himalaya"      "The Americans" "bird"          "Pacific ocean"

Answer 4

您可以選擇將簡單的正則表達式與， grepl和gsub一起使用，如下所示：

df <- data.frame(cbind(questions,responses), stringsAsFactors = F) #Changing to data frame, since cbind gives a matrix, stringsAsFactors will prevent to not change the columns to factors
regx <- paste0(geowords, collapse="|") # The "or" condition between the geowords 
articlegrep <- paste0(articles, collapse="|") # The "or" condition between the articles
df$responses <- ifelse(grepl(regx, df$questions)|grepl(paste0("(",articlegrep,")","\\s[a-z]"), df$responses), 
       gsub("\\w+ (.*)","\\1",df$responses),df$responses) #The if condition for which replacement has to happen

> print(df)
                          questions     responses
#1 The highest mountain in the world      Himalaya
#2        A cold war serie from 2013 The Americans
#3       A kiwi which is not a fruit          bird
#4       Widest liquid area on earth Pacific ocean

有條件地刪除R中的特定單詞

問題描述

4 個解決方案

解決方案1
0 已采納 2017-12-01 11:59:22

解決方案2
0 2017-12-01 14:03:55

解決方案3
0 2017-12-01 23:04:33

解決方案4
0 2017-12-02 17:53:00

有條件地刪除R中的特定單詞

問題描述

4 個解決方案

解決方案1 0 已采納 2017-12-01 11:59:22

解決方案2 0 2017-12-01 14:03:55

解決方案3 0 2017-12-01 23:04:33

解決方案4 0 2017-12-02 17:53:00

解決方案1
0 已采納 2017-12-01 11:59:22

解決方案2
0 2017-12-01 14:03:55

解決方案3
0 2017-12-01 23:04:33

解決方案4
0 2017-12-02 17:53:00