如果满足条件，如何保留一行并删除其他

Question

I am working with taxonomic data and have gotten my data to the second last step before I can display it graphically. 我正在使用分类数据，在将数据图形显示之前，将数据移至第二步。 However, I need rows to match conditions and this is where I am stuck - well stuck because I do not want to do it manually. 但是，我需要匹配条件的行，而这就是我遇到的问题-因为我不想手动执行操作，所以遇到了麻烦。
My data: 我的资料：

x <- data.frame("Phylum" = c("Chordata", "Chordata", "Chordata", "Chordata", "Chordata", "Chordata"),
                "Class" = c("NA", "Actinopterygii", "Actinopterygii", "Actinopterygii", "Actinopterygii", "Actinopterygii"),
                "Order" = c("NA", "NA", "Gadiformes", "Gadiformes", "Gadiformes", "Gadiformes"), 
                "Family" = c("NA", "NA", "NA", "Moridae", "Moridae", "Moridae"), 
                "Genus" = c("NA", "NA", "NA", "NA", "Notophycis", "Notophycis"), 
                "Species" = c("NA", "NA", "NA", "NA", "NA", "Notophycis marginata"),
                 Number = c(21616, 12123, 1497, 730,730,730))

The wanted end result: 所需的最终结果：

y <- data.frame("Phylum" = c("Chordata", "Chordata", "Chordata", "Chordata"), 
                "Class" = c("NA", "Actinopterygii", "Actinopterygii", "Actinopterygii"), 
                "Order" = c("NA", "NA", "Gadiformes", "Gadiformes"), "Family" = c("NA", "NA", "NA", "Moridae"), 
                "Genus" = c("NA", "NA", "NA", "Notophycis"), "Species" = c("NA", "NA", "NA", "Notophycis marginata"), 
                 Number = c(9493, 10626, 767, 730))

This is a simple subset example from a much larger more complicated dataset. 这是来自更大，更复杂的数据集的简单子集示例。 So if I could put this into code somehow: 因此，如果我能以某种方式将其放入代码中：

sum of Number ( Phylum == "P1" & Class == "NA" ) - sum of Number ( Class == "C1" & Order == "NA" ) IF phylum matches and this would equal P1's new Number 数之和（ Phylum == "P1" & Class == "NA" ）-数之和（ Class == "C1" & Order == "NA" ）如果门系匹配，并且这等于P1的新数字
sum of Number ( Class == "C1" & Order== "NA" ) - sum of Number ( Order == "O1" & Family == "NA" ) IF class matches and this would equal C1's new Number etc... 数的总和（ Class == "C1" & Order== "NA" ）-数的总和（ Order == "O1" & Family == "NA" ），如果类匹配，则等于C1的新数字等。。

BUT if the Number matches for multiple rows I need to have code to evaluate those rows and choose the row that has the least amount of NAs and keep that Number... 但是如果Number匹配多个行，我需要代码来评估这些行并选择NA数量最少的行，并保持Number ...

I would presume I am looking to code a function to do this, but have no idea where to even start! 我以为我想编写一个函数来执行此操作，但不知道从哪里开始！

Appreciate the help :) 感谢帮助:)

UPDATE UPDATE

Tester: 测试：

Phylum  Class   Order   Family  Genus   Species Reads_sum
Chordata    Elasmobranchii  Carcharhiniformes   NA  NA  NA  31
Chordata    Actinopterygii  Perciformes Scombridae  NA  NA  589
Chordata    Elasmobranchii  Carcharhiniformes   Pentanchidae    NA  NA  31
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    NA  208
Chordata    Actinopterygii  Perciformes Scombridae  Katsuwonus  NA  589
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    Notoscopelus caudispinosus  178
Chordata    Actinopterygii  Perciformes Scombridae  Katsuwonus  Katsuwonus pelamis  589
Cnidaria    Hydrozoa    Leptothecata    Plumulariidae   NA  NA  69
Cnidaria    Hydrozoa    Leptothecata    Plumulariidae   Plumularia  NA  69
Echinodermata   Ophiuroidea NA  NA  NA  NA  146
Echinodermata   Ophiuroidea Ophiurida   NA  NA  NA  137
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  NA  NA  137
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  Ophioplinthus   NA  137
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  Ophioplinthus   Ophioplinthus accomodata    137
Mollusca    Cephalopoda Oegopsida   Ommastrephidae  NA  NA  34311
Ochrophyta  Phaeophyceae    Ectocarpales    Acinetosporaceae    NA  NA  29

Code that preforms what I would like BUT would have to change the variables each time: 符合我的要求的代码每次都必须更改变量：

Tester$Reads_sum[Tester$Class == "Ophiuroidea" & Tester$Order == "NA"] - sum(Tester$Reads_sum[Tester$Class == "Ophiuroidea" & Tester$Order != "NA" & Tester$Family == "NA"])

And so I was hoping something like this would work and I would just need to change Class to other selected taxonomic ranks: 因此，我希望这样的事情行之有效，我只需要将Class更改为其他选定的生物分类等级即可：

for (i in unique(Tester$Class)){
  Tester$Test.1 <- ifelse(Tester$Class != "NA" & Tester$Order == "NA", 
                           Tester$Reads_sum[Tester$Class == i & Tester$Order == "NA"] - sum(Tester$Reads_sum[Tester$Class == i & Tester$Order != "NA" & Tester$Family == "NA"]), 0)
  }

But it is giving me an NA instead of 9. 但这给了我一个NA，而不是9。

The end data should look like this: 最终数据应如下所示：

Phylum  Class   Order   Family  Genus   Species Reads_sum
Chordata    Elasmobranchii  Carcharhiniformes   Pentanchidae    NA  NA  31
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    NA  30
Chordata    Actinopterygii  Myctophiformes  Myctophidae Notoscopelus    Notoscopelus caudispinosus  178
Chordata    Actinopterygii  Perciformes Scombridae  Katsuwonus  Katsuwonus pelamis  589
Cnidaria    Hydrozoa    Leptothecata    Plumulariidae   Plumularia  NA  69
Echinodermata   Ophiuroidea NA  NA  NA  NA  9
Echinodermata   Ophiuroidea Ophiurida   Ophiuridae  Ophioplinthus   Ophioplinthus accomodata    137
Mollusca    Cephalopoda Oegopsida   Ommastrephidae  NA  NA  34311
Ochrophyta  Phaeophyceae    Ectocarpales    Acinetosporaceae    NA  NA  29

Answer 1

thanks for the update. 感谢更新。 I've come up with something I think meets what you are looking for, but will need some support. 我提出了一些我认为可以满足您需要的东西，但是需要一些支持。

Am I right in thinking its a tree like data in the order c("Phylum", "Class", "Order", "Family", "Genus", "Species") ? 我是否以c("Phylum", "Class", "Order", "Family", "Genus", "Species")的顺序像数据树一样思考？ and that you are interested in finding for each level of the tree, you want to remove the values of the layer below? 并且您有兴趣查找树的每个级别，是否要删除下面的图层的值？

I hope my code isn't too confusing, I found the data challenging to use in its current format. 我希望我的代码不要太混乱，我发现数据以其当前格式使用具有挑战性。 I prefer to split it in to the levels of the tree, ie the ones that just have Phylum data, all the way down to the ones that have all levels of the tree. 我更喜欢将其拆分为树的层次，即仅具有Phylum数据的层次，一直到具有树的所有层次的层次。 To do so, I am most comfortable using the data.table package. 为此，我最喜欢使用data.table包。

I've used lapply's where I can as I find them easy to interpret once you use them a lot. 我经常使用lapply's ，因为一旦您经常使用它们，我就会发现它们很容易解释。 I'm sure there will be a more efficient solution out there, but as a starter, I think knowing and understanding the steps required are more important. 我敢肯定会有一个更有效的解决方案，但是作为一个入门者，我认为了解和理解所需的步骤更为重要。

# using data.table package, as I find it quicker and easier to work with 
# for complex problems. Run the hashed out command below if you dont have it
# install.packages("data.table")
library(data.table)

# turning in to a data.table, similar to data.frame, but some differences.
dt <- as.data.table(Tester)
# I am making an id, which I will use to split up this data. Different rows 
# have different structures, as its a tree structure, so I am going to break
# the data up
dt[, id := 1:.N]

# to do so i need to know the order of significance of the tree. I believe
# they go in this order:
col_structure <- c("Phylum", "Class", "Order", "Family", "Genus", "Species")

# I want to find out at which level of the tree each row is, so I am going
# to change teh shape from wide to long, and then do some row aggregation on 
# the single column, to group
melt_dt <- melt(dt, id.vars = "id", 
                measure.vars = col_structure)
# tip: try not to use "NA", but instead NA, they have different structures 
# and built in commands like is.na make them easier to differentiate
melt_dt[value == "NA", value := NA]
melt_dt <- melt_dt[!is.na(value)]
melt_dt[]
# using a data.table command .N, grouped by id, to find out how many non NA
# values there are, this will tell me where it is in the tree
group_ids <- melt_dt[, .N, by = id]

# Ok, so now I will split up each row in to where it sits in the tree
split_ids <- split(group_ids, group_ids$N)
split_ids
# pull out the number of levels of tree for easy use
levels <- seq_along(split_ids)

# merge back in the original data, so we have the same data at the start, but
# split up in to new sets. Makes it easier to think about the problem
split_dt <- lapply(levels, function(x){
  out <- merge(split_ids[[x]], dt, by = "id")
  N <- as.numeric(names(split_ids)[x])
  # using keys in my data, to make easy extraction. means rather than do
  # Phylum == "a" & Class == "b" later on, if Phylum & Class are the keys,
  # then can use command J("a", "b"). See next stage
  setkeyv(out, col_structure[1:N])
  out
})

# Now I'm going to add the value in. I will look at the next level of the tree
# and remove the values from that level from the reads_sum. Try it with setting
# x = 1.
# I've removed bottom element of the tree, don't know what to do with them
split_dt_with_value <- lapply(levels[1:(length(levels)-1)], function(x){
  # similar to for loop, but using data.table keys to extract data
  out <- split_dt[[x]]
  out$Test.1 <- out$Reads_sum - sapply(1:nrow(out), function(i){
    sum(split_dt[[(x+1)]][J(out[i, key(out), with = FALSE])]$Reads_sum,
        na.rm = TRUE)
  })
  out
})

# combine results, and with the bottom tree level
combined <- rbindlist(c(split_dt_with_value,
                        split_dt[max(levels)]), 
                        fill = TRUE)
# turn it back in to data frame form 
combined <- as.data.frame(combined)
combined

please have a look and let me know if any steps are confusing, or any of the logic is incorrect :) 请看一下，让我知道是否有任何步骤令人困惑，或者任何逻辑不正确:)

Cheers, Jonny 干杯，强尼

如果满足条件，如何保留一行并删除其他

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-09-15 10:34:11

如果满足条件，如何保留一行并删除其他

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-09-15 10:34:11

解决方案1
0 已采纳 2018-09-15 10:34:11