简体   繁体   English

删除跨列的值包含4个以上唯一字符中的2个的行

[英]Remove rows whose values across columns contain more than 2 of 4 unique characters

Hopefully the wording of the title makes sense. 希望标题的措辞有意义。 I have a data frame that consists of values: "A", "B", "C", "D", "", "A/B". 我有一个由值组成的数据框:“A”,“B”,“C”,“D”,“”,“A / B”。 I want to identify which rows contain only 2 of "A", "B", "C", or "D". 我想确定哪些行只包含2个“A”,“B”,“C”或“D”。 The frequency of each of these letters within the row does not matter. 行中每个字母的频率无关紧要。 I just want to know if more than 2 of those 4 letters exists in the row. 我只是想知道这行中是否有超过2个字母。

Here is a sample data frame: 这是一个示例数据框:

    df.sample = as.data.frame(rbind(c("A","B","A","A/B","B","B","B","B","","B"),c("A","B","C","A","B","","","B","","B"),c("A","B","D","D","B","B","B","B","","B"),c("A","B","A","A","B","B","B","B","B","B")))
    df.sample

      V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
    1  A  B  A A/B  B  B  B  B      B
    2  A  B  C   A  B        B      B
    3  A  B  D   D  B  B  B  B      B
    4  A  B  A   A  B  B  B  B  B   B

I want to apply a function to each row that determines how many of each of the 4 letters ("A","B","C",or "D") exist, not the frequency of each, but essentially just a 0 or 1 value for "A", "B", "C", and "D". 我想对每一行应用一个函数来确定4个字母(“A”,“B”,“C”或“D”)中每个字母的数量,而不是每个字母的频率,但基本上只是0或“A”,“B”,“C”和“D”的1或1。 If the sum of those 4 values is > 3, then I want to assign the index of that row to a new vector which will be used to remove those rows from the data frame. 如果这4个值的总和> 3,那么我想将该行的索引分配给一个新的向量,该向量将用于从数据帧中删除这些行。

    myfun (x){
      #which rows contain > 2 different letters of A, B, C, or D.
      #The number of times each letter occurs in a given row does not matter. 
      #What matters is if each row contains more than 2 of the 4 letters. Each row should only contain 2 of them. The combination does not matter.

      out = which(something > 2)
    }

    row.indexes = apply(df.sample,1,function(x) myfun(x)) #Return a vector of row indexes that contain more than 2 of the 4 letters.

    new.df.sample = df.sample[-row.indexes,] #create new data frame excluding rows containing more than 2 of the 4 letters.

In the df.sample above, rows 2 and 3 contain more than 2 of those 4 letters and thus should be indexed for removal. 在上面的df.sample中,第2行和第3行包含这4个字母中的2个以上,因此应该对其进行索引以便删除。 After running the df.sample through the function and removing rows in row.indexes, my new.df.sample data frame should look like this: 在通过函数运行df.sample并删除row.indexes中的行之后,我的new.df.sample数据框应该如下所示:

      V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
    1  A  B  A A/B  B  B  B  B      B
    4  A  B  A   A  B  B  B  B  B   B

I have tried to think of this as a logical statement for each of the 4 letters which then assigns a 0 or 1 to each letter, sums them up, and then identifies which ones sum to > 2. For instance, I thought perhaps I could try 'grep()' and convert that to a logical for each letter, which was then converted to a 0 or 1 and summed. 我试着将这个作为4个字母中每个字母的逻辑陈述,然后将每个字母分配0或1,将它们相加,然后确定哪些总和为> 2.例如,我想也许我可以尝试'grep()'并将其转换为每个字母的逻辑,然后将其转换为0或1并求和。 That seems too lengthy and didn't work with the way I tried it. 这似乎太冗长了,并没有按照我尝试的方式工作。 Any ideas? 有任何想法吗?

Here's a function for this task. 这是此任务的功能。 The function returns a logical value. 该函数返回一个逻辑值。 TRUE indicates rows with more than two different strings: TRUE表示具有两个以上不同字符串的行:

myfun <- function(x) {
  sp <- unlist(strsplit(x, "/"))
  length(unique(sp[sp %in% c("A", "B", "C", "D")])) > 2
}

row.indexes <- apply(df.sample, 1, myfun)
# [1] FALSE  TRUE  TRUE FALSE

new.df.sample <- df.sample[!row.indexes, ] # negate the index with '!'

#   V1 V2 V3  V4 V5 V6 V7 V8 V9 V10
# 1  A  B  A A/B  B  B  B  B      B
# 4  A  B  A   A  B  B  B  B  B   B

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM