[英]R traverse through columns and keep only row that contains '&' or 'and'
我有一個多列的數據框。 列A包含一個重復的數字。 B列包含一個名字。 我想搜索所有行,並為列A的相等值保留所有僅包含'&'符號或在列B中具有單詞'and'的行。如果所有條目都沒有這些值之一,那么我只想要保留任何一行都沒關系。 樣本數據:
Column A Column B
12345 John
12345 Mary and Bob
12345 Ben
44444 Jim
44444 Larry & Meg
55555 Tommy
預期產量:
Column A Column B
12345 Mary and Bob
44444 Larry & Meg
55555 Tommy
您可以使用ave
和grepl
獲得匹配的行:
dat[ave(dat$ColumnB, dat$ColumnA, FUN=function(x) {
g <- grepl("( & )|( and )", x)
if (all(!g)) {
seq_along(x) == 1
} else {
g
}
}) == "TRUE",]
# ColumnA ColumnB
# 2 12345 Mary and Bob
# 5 44444 Larry & Meg
# 6 55555 Tommy
數據:
dat = data.frame(ColumnA=c(12345, 12345, 12345, 44444, 44444, 55555), ColumnB=c("John", "Mary and Bob", "Ben", "Jim", "Larry & Meg", "Tommy"), stringsAsFactors=FALSE)
嘗試
library(data.table)
setDT(df1)[ , {tmp <- grepl('\\band\\b|&', ColumnB)
.SD[tmp|all(!tmp)]}, ColumnA]
# ColumnA ColumnB
#1: 12345 Mary and Bob
#2: 44444 Larry & Meg
#3: 55555 Tommy
或使用dplyr
library(dplyr)
df1 %>%
group_by(ColumnA) %>%
mutate(tmp= grepl('\\band\\b|&', ColumnB)) %>%
filter(tmp|all(!tmp))%>%
select(-tmp)
# ColumnA ColumnB
#1 12345 Mary and Bob
#2 44444 Larry & Meg
#3 55555 Tommy
df1 <- structure(list(ColumnA = c(12345L, 12345L, 12345L, 44444L, 44444L,
55555L), ColumnB = c("John", "Mary and Bob", "Ben", "Jim", "Larry & Meg",
"Tommy")), .Names = c("ColumnA", "ColumnB"), class = "data.frame",
row.names = c(NA, -6L))
您想將數據集分為兩對和單身,對ID進行重復數據刪除,然后返回所有沒有兩對的對和單身。
# Reproducible Example!
df <- data.frame(a=c(rep(12345,3),rep(44444,2),55555),
b=c("John","Mary and Bob","Ben","Jim","Larry & Meg","Tommy")
)
couples <- which(grepl("&| and ",df$b,ignore.case=T))
df_couples <- df[couples,][!duplicated(df$a[couples]),]
df_singles <- df[-couples,][!duplicated(df$a[-couples]),]
rbind(df_couples, df_singles[!df_singles$a %in% df_couples$a,])
#
# a b
# 2 12345 Mary and Bob
# 5 44444 Larry & Meg
# 6 55555 Tommy
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.