[英]How to a create a new dataframe of consolidated values from multiple columns in R
I have a dataframe, df1, that looks like the following:我有一个数据框 df1,如下所示:
sample![]() |
99_Ape_1 ![]() |
93_Cat_1 ![]() |
87_Ape_2 ![]() |
84_Cat_2 ![]() |
90_Dog_1 ![]() |
92_Dog_2 ![]() |
---|---|---|---|---|---|---|
A![]() |
2 ![]() |
3 ![]() |
1 ![]() |
7 ![]() |
4 ![]() |
6 ![]() |
B![]() |
5 ![]() |
9 ![]() |
7 ![]() |
0 ![]() |
3 ![]() |
7 ![]() |
C ![]() |
6 ![]() |
8 ![]() |
9 ![]() |
2 ![]() |
3 ![]() |
0 ![]() |
D ![]() |
3 ![]() |
9 ![]() |
0 ![]() |
5 ![]() |
8 ![]() |
3 ![]() |
I want to consolidate the dataframe by summing the values based on animal present in the header row, ie by "Ape", "Cat", "Dog", and end up with the following dataframe:我想通过对基于标题行中存在的动物(即“猿”、“猫”、“狗”)的值求和来合并数据帧,并最终得到以下数据帧:
sample![]() |
Ape![]() |
Cat![]() |
Dog![]() |
---|---|---|---|
A![]() |
3 ![]() |
10 ![]() |
10 ![]() |
B![]() |
12 ![]() |
9 ![]() |
10 ![]() |
C ![]() |
15 ![]() |
10 ![]() |
3 ![]() |
D ![]() |
3 ![]() |
14 ![]() |
11 ![]() |
I have created a list that represents all the animals called "animals_list"我创建了一个列表,代表所有名为“animals_list”的动物
I have then created a list of dataframes that subsets each animal into a separate dataframe with:然后我创建了一个数据框列表,将每个动物分成一个单独的数据框:
animals_extract <- c()
for (i in 1:length(animals_list)){
species_extract[[i]] <- df1[, grep(animals_list[i], names(df1))]
}
I am then trying to sum each variable in the row by sample:然后我试图按样本对行中的每个变量求和:
for (i in 1:length(species_extract)){
species_extract[[i]]$total <- rowSums(species_extract[[i]])
}
and then create a dataframe 'animal_total' by binding all values in the new 'total' column.然后通过绑定新的“总计”列中的所有值来创建数据框“animal_total”。
animal_total <- NULL
for (i in 1:length(species_extract)){
animal_total[i] <- cbind(species_extract[[i]]$total)
}
Unfortunately, this doesn't seem to work at all and I think I may have taken the wrong route.不幸的是,这似乎根本不起作用,我想我可能走错了路。 Any help would be really appreciated!
任何帮助将非常感激!
EDIT: my dataframe has over 300 animals, meaning incorporating use of my list of identifiers (animals_list) would be highly appreciated!编辑:我的数据框有超过 300 只动物,这意味着合并使用我的标识符列表 (animals_list) 将不胜感激! I would also note that some column names do not follow the structure, "number_animal_number" and therefore I can't use a repetitive search (sorry!).
我还要注意一些列名不遵循结构“number_animal_number”,因此我不能使用重复搜索(对不起!)。
a data.table
approach数据
data.table
方法
library(data.table)
library(rlist)
#set data to data.table format
setDT(df1)
# split column 2:n by regex on column names
L <- split.default(df1[,-1], gsub(".*_(.*)_.*", "\\1", names(df1)[-1]))
# Bind together again
data.table(sample = df1$sample,
as.data.table(list.cbind(lapply(L, rowSums))))
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Update: After clarification: This may work depending on the other names of your animals.更新:澄清后:这可能会起作用,具体取决于您的动物的其他名称。 but this is a start:
但这是一个开始:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
cols = -sample
) %>%
mutate(name1 = str_extract(name, '(?<=\\_)(.*?)(?=\\_)')) %>%
group_by(sample, name1) %>%
summarise(sum=sum(value)) %>%
pivot_wider(
names_from = name1,
values_from= sum
)
Output:输出:
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
First answer: Here is how we could do it with dplyr
:第一个答案:这是我们如何使用
dplyr
做到这dplyr
:
library(dplyr)
df %>%
mutate(Cat = rowSums(select(., contains("Cat"))),
Ape = rowSums(select(., contains("Ape"))),
Dog = rowSums(select(., contains("Dog")))) %>%
select(sample, Cat, Ape, Dog)
sample Ape Cat Dog
<chr> <int> <int> <int>
1 A 3 10 10
2 B 12 9 10
3 C 15 10 3
4 D 3 14 11
An alternative data.table solution另一种 data.table 解决方案
library(data.table)
# Construct data table
dt <- as.data.table(list(sample = c("A", "B", "C", "D"),
`99_Ape_1` = c(2, 5, 6, 3),
`93_Cat_1` = c(3, 9, 8, 9),
`87_Ape_2` = c(1, 7, 9, 0),
`84_Cat_2` = c(7, 0, 2, 5),
`90_Dog_1` = c(4, 3, 3, 8),
`92_Dog_2` = c(6, 7, 0, 3)))
# Alternatively convert existing dataframe
# dt <- setDT(df)
# Use Regex pattern to drop ids from column names
names(dt) <- gsub("((^[0-9_]{3})|(_[0-9]{1}$))", "", names(dt))
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample Ape Cat Dog
# 1: A 3 10 10
# 2: B 12 9 10
# 3: C 15 10 3
# 4: D 3 14 11
Alternatively, leaving the column names as is (after comment from OP to previous answer) and assuming that there are multiple observations of the same samples:或者,保留列名(在从 OP 评论到上一个答案之后)并假设对相同样本有多个观察:
dt <- as.data.table(list(sample = c("A", "B", "C", "D", "A"),
`99_Ape_1` = c(2, 5, 6, 3, 1),
`93_Cat_1` = c(3, 9, 8, 9, 1),
`87_Ape_2` = c(1, 7, 9, 0, 1),
`84_Cat_2` = c(7, 0, 2, 5, 1),
`90_Dog_1` = c(4, 3, 3, 8, 1),
`92_Dog_2` = c(6, 7, 0, 3, 1)))
dt
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 2 3 1 7 4 6
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
# 5: A 1 1 1 1 1 1
# Pivot long (columns to rows)
dt <- melt(dt, id.vars = "sample")
# Aggregate sample by variable
dt <- dt[, .(value=sum(value)), by=.(sample, variable)]
# Unpivot (rows to colums)
dcast(dt, sample ~ variable)
# sample 99_Ape_1 93_Cat_1 87_Ape_2 84_Cat_2 90_Dog_1 92_Dog_2
# 1: A 3 4 2 8 5 7
# 2: B 5 9 7 0 3 7
# 3: C 6 8 9 2 3 0
# 4: D 3 9 0 5 8 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.