[英]R - Removing rows from dataframe based on multiple conditions for a single column
I have the following example dataframe in R:我在 R 中有以下示例 dataframe:
SampleID <- c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "C", "D", "D", "E", "E", "E", "E", "F", "F")
Analyte <- c("A1", "A1", "A2", "A2", "B1", "B2", "C1", "C1", "C1", "C2", "C2", "C2", "D1", "D2", "E1", "E1", "E2", "E2", "F1", "F2")
Fraction <- c("Dissolved", "Total", "Dissolved", "Total", "Total", "Total", "Dissolved", "Suspended", "Total", "Dissolved", "Suspended", "Total", "Unknown", "Unknown", "Dissolved", "Suspended", "Dissolved", "Suspended", "Dissolved", "Dissolved")
Concentration <- c(4.2, 5.6, 8.6, 11.2, 2.1, 9.6, 15.6, 28.7, 42.3, 18.3, 23.2, 48.6, 6.4, 28.8, 9.1, 32.5, 36.4, 24.5, 10.7, 3.4)
MyData <- data.frame(SampleID, Analyte, Fraction, Concentration)
MyData
SampleID Analyte Fraction Concentration
1 A A1 Dissolved 4.2
2 A A1 Total 5.6
3 A A2 Dissolved 8.6
4 A A2 Total 11.2
5 B B1 Total 2.1
6 B B2 Total 9.6
7 C C1 Dissolved 15.6
8 C C1 Suspended 28.7
9 C C1 Total 42.3
10 C C2 Dissolved 18.3
11 C C2 Suspended 23.2
12 C C2 Total 48.6
13 D D1 Unknown 6.4
14 D D2 Unknown 28.8
15 E E1 Dissolved 9.1
16 E E1 Suspended 32.5
17 E E2 Dissolved 36.4
18 E E2 Suspended 24.5
19 F F1 Dissolved 10.7
20 F F2 Dissolved 3.4
I would like to do the following:我想做以下事情:
For each SampleID
, if an Analyte
has a "Total" Fraction
reported, retain only that row for the Analyte
and remove rows with any other Fraction
value (ie, Dissolved, Suspended) for that Analyte
.对于每个SampleID
,如果Analyte
报告了“Total” Fraction
,则仅保留Analyte
的该行并删除该Analyte
具有任何其他Fraction
值(即 Dissolved、Suspended)的行。
If an Analyte
for a SampleID
includes both Dissolved and Suspended in the Fraction
column (and no other values for Fraction
), sum the concentrations for Dissolved and Suspended and add a row for that Analyte
with the Fraction
column labeled Total and the Concentration
column listing the sum.如果SampleID
的Analyte
在Fraction
列中包括 Dissolved 和 Suspended(并且Fraction
没有其他值),将 Dissolved 和 Suspended 的浓度相加,并为该Analyte
添加一行,其中Fraction
列标记为 Total, Concentration
列列出和。 Remove the original rows for Dissolved and Suspended for that Analyte
.删除该Analyte
的 Dissolved 和 Suspended 的原始行。
So for the dataframe above, the two Analytes
of SampleID
"A" have Dissolved and Total, so I would want to remove the rows with the Dissolved Fraction
.因此,对于上面的 dataframe, SampleID
"A" 的两个Analytes
已经 Dissolved 和 Total,所以我想删除带有 Dissolved Fraction
的行。 For SampleID
"C", I would want to remove the Dissolved and Suspended Fractions
of both Analytes
and just keep the rows with Total.对于SampleID
“C”,我想删除两种Analytes
的溶解和悬浮Fractions
,只保留总计的行。 And lastly, for SampleID
"E", the Dissolved and Suspended Fractions
for each of the two Analytes
would be summed together and the result would be a new row for each Analyte
that represents the sum (relabeled as Total), and the rows associated with the Dissolved and Suspended Fractions
would be removed.最后,对于SampleID
“E”,将两种Analytes
中每一种的溶解和悬浮Fractions
相加,结果将是每个Analyte
的新行,表示总和(重新标记为总计),以及与溶解和悬浮的Fractions
将被删除。
The output of the above dataframe MyData
would be the following:上述 dataframe MyData
的 output 将如下:
SampleID Analyte Fraction Concentration
2 A A1 Total 5.6
4 A A2 Total 11.2
5 B B1 Total 2.1
6 B B2 Total 9.6
9 C C1 Total 42.3
12 C C2 Total 48.6
13 D D1 Unknown 6.4
14 D D2 Unknown 28.8
15 E E1 Total 41.6
17 E E2 Total 60.9
19 F F1 Dissolved 10.7
20 F F2 Dissolved 3.4
Note that the example I have provided is just a small subset of a much larger dataset that includes hundreds of SampleIDs
, but the Fraction
column can only equal the values listed in the original dataframe above (ie, Dissolved, Suspended, Total, or Unknown).请注意,我提供的示例只是包含数百个SampleIDs
的更大数据集的一小部分,但Fraction
列只能等于上面原始 dataframe 中列出的值(即 Dissolved、Suspended、Total 或 Unknown) .
Thank you!谢谢!
This could be done as:这可以这样做:
library(tidyverse)
MyData %>%
pivot_wider(c(SampleID, Analyte),Fraction, values_from = Concentration) %>%
mutate(Total = coalesce(Total, Dissolved + Suspended),
Dissolved = ifelse(is.na(Total)&is.na(Suspended), Dissolved, NA),
Suspended = ifelse(is.na(Total)&is.na(Dissolved), Suspended, NA)) %>%
pivot_longer(-c(SampleID, Analyte), values_drop_na = TRUE)
# A tibble: 12 x 4
SampleID Analyte name value
<chr> <chr> <chr> <dbl>
1 A A1 Total 5.6
2 A A2 Total 11.2
3 B B1 Total 2.1
4 B B2 Total 9.6
5 C C1 Total 42.3
6 C C2 Total 48.6
7 D D1 Unknown 6.4
8 D D2 Unknown 28.8
9 E E1 Total 41.6
10 E E2 Total 60.9
11 F F1 Dissolved 10.7
12 F F2 Dissolved 3.4
You can also use the following solution.您也可以使用以下解决方案。 It may sound a bit verbose but will also get the job done:这可能听起来有点冗长,但也可以完成工作:
library(dplyr)
library(purrr)
MyData %>%
group_split(SampleID, Analyte) %>%
map(~ if("Total" %in% .x$Fraction) {
.x %>% filter(Fraction == "Total")} else {
.x
}) %>%
map(~ if(all(c("Dissolved", "Suspended") %in% .x$Fraction)) {
add_row(.x, SampleID = .x$SampleID[1], Analyte = .x$Analyte[1],
Fraction = "Total", Concentration = sum(.x$Concentration))
} else {
.x
}) %>%
map_dfr(~ if("Total" %in% .x$Fraction) {
.x %>% filter(Fraction == "Total")} else {
.x
})
# A tibble: 12 x 4
SampleID Analyte Fraction Concentration
<chr> <chr> <chr> <dbl>
1 A A1 Total 5.6
2 A A2 Total 11.2
3 B B1 Total 2.1
4 B B2 Total 9.6
5 C C1 Total 42.3
6 C C2 Total 48.6
7 D D1 Unknown 6.4
8 D D2 Unknown 28.8
9 E E1 Total 41.6
10 E E2 Total 60.9
11 F F1 Dissolved 10.7
12 F F2 Dissolved 3.4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.