簡體   English   中英

R遍歷列並僅保留包含'&'或'and'的行

[英]R traverse through columns and keep only row that contains '&' or 'and'

我有一個多列的數據框。 列A包含一個重復的數字。 B列包含一個名字。 我想搜索所有行,並為列A的相等值保留所有僅包含'&'符號或在列B中具有單詞'and'的行。如果所有條目都沒有這些值之一,那么我只想要保留任何一行都沒關系。 樣本數據:

Column A           Column B     
12345                John
12345                Mary and Bob
12345                Ben
44444                Jim
44444                Larry & Meg
55555                Tommy

預期產量:

Column A            Column B
12345               Mary and Bob
44444               Larry & Meg
55555               Tommy

您可以使用avegrepl獲得匹配的行:

dat[ave(dat$ColumnB, dat$ColumnA, FUN=function(x) {
  g <- grepl("( & )|( and )", x)
  if (all(!g)) {
    seq_along(x) == 1
  } else {
    g
  }
}) == "TRUE",]
#   ColumnA      ColumnB
# 2   12345 Mary and Bob
# 5   44444  Larry & Meg
# 6   55555        Tommy

數據:

dat = data.frame(ColumnA=c(12345, 12345, 12345, 44444, 44444, 55555), ColumnB=c("John", "Mary and Bob", "Ben", "Jim", "Larry & Meg", "Tommy"), stringsAsFactors=FALSE)

嘗試

library(data.table)
setDT(df1)[ , {tmp <- grepl('\\band\\b|&', ColumnB)
               .SD[tmp|all(!tmp)]}, ColumnA]
#   ColumnA      ColumnB
#1:   12345 Mary and Bob
#2:   44444  Larry & Meg
#3:   55555        Tommy

或使用dplyr

library(dplyr)
df1 %>% 
   group_by(ColumnA) %>% 
   mutate(tmp= grepl('\\band\\b|&', ColumnB)) %>% 
   filter(tmp|all(!tmp))%>%
   select(-tmp)

#  ColumnA      ColumnB
#1   12345 Mary and Bob
#2   44444  Larry & Meg
#3   55555        Tommy

數據

df1 <- structure(list(ColumnA = c(12345L, 12345L, 12345L, 44444L, 44444L, 
55555L), ColumnB = c("John", "Mary and Bob", "Ben", "Jim", "Larry & Meg", 
"Tommy")), .Names = c("ColumnA", "ColumnB"), class = "data.frame",
row.names = c(NA, -6L))

您想將數據集分為兩對和單身,對ID進行重復數據刪除,然后返回所有沒有兩對的對和單身。

# Reproducible Example!
df <- data.frame(a=c(rep(12345,3),rep(44444,2),55555),
                 b=c("John","Mary and Bob","Ben","Jim","Larry & Meg","Tommy")
)
couples <- which(grepl("&| and ",df$b,ignore.case=T))

df_couples <- df[couples,][!duplicated(df$a[couples]),]
df_singles <- df[-couples,][!duplicated(df$a[-couples]),]

rbind(df_couples, df_singles[!df_singles$a %in% df_couples$a,])
# 
#       a            b
# 2 12345 Mary and Bob
# 5 44444  Larry & Meg
# 6 55555        Tommy

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM