简体   繁体   English

如何将多个列与grep合并并求和r

[英]how to combine multiple columns with grep and sum the values in r

I have following dataframe in r 我在R中有以下数据框

Engine   General   Ladder.winch   engine.phe   subm.gear.box   aux.engine   pipeline.maintain    pipeline    pipe.line    engine.mpd
 1        12        22             2            4               2             4                    5            6             7

and so on with more than 10000 rows. 等等,超过10000行。

Now,I want to combine columns and add values to reduce the columns into broader categories. 现在,我想合并列并添加值以将列减少为更广泛的类别。 eg Engine,engine.phe,aux.engine,engine.mpd should be combined into Engine category and all the values to be added. 例如Engine,engine.phe,aux.engine,engine.mpd合并到Engine类别中,并添加所有值。 likewise pipeline.maintain,pipeline,pipe.line to be combined into Pipeline And rest columns to be added under General Category. 同样,将pipeline.maintain,pipeline,pipe.line合并到Pipeline和rest列中,将其添加到General Category下。

Desired dataframe would be 所需的数据帧将是

 Engine      Pipeline       General
   12          15             38

How can I do it in r? 我如何在R中做到这一点?

Many ways in which you can do it, this is a more straight forward approach 您可以通过多种方式做到这一点,这是一种更直接的方法

# Example data.frame
dtf <- structure(list(Engine = c(1, 0, 1), 
   General = c(12, 3, 15), Ladder.winch = c(22, 28, 26), 
    engine.phe = c(2, 1, 0), subm.gear.box = c(4, 4, 10), 
    aux.engine = c(2, 3, 1), pipeline.maintain = c(4, 5, 1), 
    pipeline = c(5, 5, 2), pipe.line = c(6, 8, 2), engine.mpd = c(7, 8, 19)),
    .Names = c("Engine", "General", "Ladder.winch", "engine.phe", 
      "subm.gear.box", "aux.engine", "pipeline.maintain", 
      "pipeline", "pipe.line", "engine.mpd"), 
    row.names = c(NA, -3L), class = "data.frame")

with(dtf, data.frame(Engine=Engine+engine.phe+aux.engine+engine.mpd,
                   Pipeline=pipeline.maintain+pipeline+pipe.line,
                    General=General+Ladder.winch+subm.gear.box))

#   Engine Pipeline General
# 1     12       15      38
# 2     12       18      35
# 3     21        5      51

# a more generalized and 'greppy' solution
cnames <- tolower(colnames(dtf))
data.frame(Engine=rowSums(dtf[, grep("eng", cnames)]),
         Pipeline=rowSums(dtf[, grep("pip", cnames)]),
          General=rowSums(dtf[, !grepl("eng|pip", cnames)]))

It is mostly better to store you data in long format. 最好以长格式存储数据。 Therefore, my proposal would to approach your problem as below: 因此,我的建议将按以下方式解决您的问题:

1 - get your data in long format 1-以长格式获取数据

library(reshape2)
dfl <- melt(df)

2 - create 'engine' and 'pipeline'-vectors 2-创建“引擎”和“管道”向量

e_vec <- c("Engine","engine.phe","aux.engine","engine.mpd")
p_vec <- c("pipeline.maintain","pipeline","pipe.line")

3 - create a category column 3-创建类别列

dfl$newcat <- c("general","engine","pipeline")[1 + dfl$variable %in% e_vec + 2*(dfl$variable %in% p_vec)]

The result: 结果:

> dfl
            variable value   newcat
1             Engine     1   engine
2            General    12  general
3       Ladder.winch    22  general
4         engine.phe     2   engine
5      subm.gear.box     4  general
6         aux.engine     2   engine
7  pipeline.maintain     4 pipeline
8           pipeline     5 pipeline
9          pipe.line     6 pipeline
10        engine.mpd     7   engine

Now you can use aggregate to get the final result: 现在,您可以使用aggregate来获得最终结果:

> aggregate(value ~ newcat, dfl, sum)
    newcat value
1   engine    12
2  general    38
3 pipeline    15

Here is an option by extracting the concerned words from the names of the column, and using tapply to get the sum . 这是一种选择,方法是从列的names中提取有关的单词,然后使用tapply来获取sum The str_extract_all returns a list ('lst'). str_extract_all返回一个list (“ lst”)。 Replace those elements which are having zero length with 'GENERAL', Then, using a group by function ie tapply , unlist the dataset, and use the grouping variables ie replicated 'lst' and the row of 'df1' get the sum 将长度为零的那些元素替换为'GENERAL',然后使用按功能分组,即tapplyunlist数据集,并使用分组变量,即复制的'lst'和'df1' row获取sum

library(stringr)
lst <- str_extract_all(toupper(sub("(pipe)\\.", "\\1", names(df1))),
          "ENGINE|PIPELINE|GENERAL")
lst[lengths(lst)==0] <- "GENERAL"
t(tapply(unlist(df1), list(unlist(lst)[col(df1)], row(df1)), FUN = sum))
#   ENGINE  GENERAL PIPELINE 
#1      12       38       15 
myfactors = ifelse(grepl("engine", names(df), ignore.case = TRUE), "Engine",
                   ifelse(grepl("pipe|pipeline", names(df), ignore.case = TRUE), "Pipeline",
                          "General"))
data.frame(lapply(split.default(df, myfactors), rowSums))
#  Engine General Pipeline
#1     12      38       15
#2     12      35       18
#3     21      51        5

df is the data from this answer df此答案的数据

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM