[英]Remove rows whose values across columns contain more than 2 of 4 unique characters
Hopefully the wording of the title makes sense. 希望标题的措辞有意义。 I have a data frame that consists of values: "A", "B", "C", "D", "", "A/B".
我有一个由值组成的数据框:“A”,“B”,“C”,“D”,“”,“A / B”。 I want to identify which rows contain only 2 of "A", "B", "C", or "D".
我想确定哪些行只包含2个“A”,“B”,“C”或“D”。 The frequency of each of these letters within the row does not matter.
行中每个字母的频率无关紧要。 I just want to know if more than 2 of those 4 letters exists in the row.
我只是想知道这行中是否有超过2个字母。
Here is a sample data frame: 这是一个示例数据框:
df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
df.sample
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
2 A B C A B B B
3 A B D D B B B B B
4 A B A A B B B B B B
I want to apply a function to each row that determines how many of each of the 4 letters ("A","B","C",or "D") exist, not the frequency of each, but essentially just a 0 or 1 value for "A", "B", "C", and "D". 我想对每一行应用一个函数来确定4个字母(“A”,“B”,“C”或“D”)中每个字母的数量,而不是每个字母的频率,但基本上只是0或“A”,“B”,“C”和“D”的1或1。 If the sum of those 4 values is > 3, then I want to assign the index of that row to a new vector which will be used to remove those rows from the data frame.
如果这4个值的总和> 3,那么我想将该行的索引分配给一个新的向量,该向量将用于从数据帧中删除这些行。
myfun (x){
#which rows contain > 2 different letters of A, B, C, or D.
#The number of times each letter occurs in a given row does not matter.
#What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.
out = which(something > 2)
}
row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.
new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.
In the df.sample above, rows 2 and 3 contain more than 2 of those 4 letters and thus should be indexed for removal. 在上面的df.sample中,第2行和第3行包含这4个字母中的2个以上,因此应该对其进行索引以便删除。 After running the df.sample through the function and removing rows in row.indexes, my new.df.sample data frame should look like this:
在通过函数运行df.sample并删除row.indexes中的行之后,我的new.df.sample数据框应该如下所示:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 A B A A/B B B B B B
4 A B A A B B B B B B
I have tried to think of this as a logical statement for each of the 4 letters which then assigns a 0 or 1 to each letter, sums them up, and then identifies which ones sum to > 2. For instance, I thought perhaps I could try 'grep()' and convert that to a logical for each letter, which was then converted to a 0 or 1 and summed. 我试着将这个作为4个字母中每个字母的逻辑陈述,然后将每个字母分配0或1,将它们相加,然后确定哪些总和为> 2.例如,我想也许我可以尝试'grep()'并将其转换为每个字母的逻辑,然后将其转换为0或1并求和。 That seems too lengthy and didn't work with the way I tried it.
这似乎太冗长了,并没有按照我尝试的方式工作。 Any ideas?
有任何想法吗?
Here's a function for this task. 这是此任务的功能。 The function returns a logical value.
该函数返回一个逻辑值。
TRUE
indicates rows with more than two different strings: TRUE
表示具有两个以上不同字符串的行:
myfun <- function(x) {
sp <- unlist(strsplit(x, "/"))
length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2
}
row.indexes <- apply(df.sample, 1, myfun)
# [1] FALSE TRUE TRUE FALSE
new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!'
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 A B A A/B B B B B B
# 4 A B A A B B B B B B
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.