简体   繁体   English

如何在 R 中循环汇总统计信息

[英]How can I summarize statistics in a loop in R

I have a dataset containing about 60 variables (A, B, C, D, ...), each with 3 corresponding information columns (A, Group_A and WOE_A) as in the list below:我有一个包含大约 60 个变量(A、B、C、D、...)的数据集,每个变量都有 3 个对应的信息列(A、Group_A 和 WOE_A),如下表所示:

ID  A   Group_A WOE_A   B   Group_B WOE_B   C   Group_C WOE_C   D   Group_D WOE_D   Status
213 0   1   0.87    0   1   0.65    0   1   0.80    915.7   4   -0.30   1
321 12  5   0.08    4   4   -0.43   6   5   -0.20   85.3    2   0.26    0
32  0   1   0.87    0   1   0.65    0   1   0.80    28.6    2   0.26    1
13  7   4   -0.69   2   3   -0.82   4   4   -0.80   31.8    2   0.26    0
43  1   2   -0.04   1   2   -0.49   1   2   -0.22   51.7    2   0.26    0
656 2   3   -0.28   2   3   -0.82   2   3   -0.65   8.5 1   1.14    0
435 2   3   -0.28   0   1   0.65    0   1   0.80    39.8    2   0.26    0
65  8   4   -0.69   3   4   -0.43   5   4   -0.80   243.0   3   0.00    0
565 0   1   0.87    0   1   0.65    0   1   0.80    4.0 1   1.14    0
432 0   1   0.87    0   1   0.65    0   1   0.80    81.6    2   0.26    0

I want to print a table in R with some statistics ( Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0) ), all grouped by Group for each of the 60 variables and I think I need to perform it in a loop.我想在 R 中打印一个带有一些统计信息的表( Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0) ), 60 个变量中的每一个都按 Group 分组,我认为我需要循环执行它。 I tried the "dplyr" package, but I don't know how to refer to all the three columns (A, Group_A and WOE_A) that relate to a variable (A) and also how to summarize the information for all the desired statistics.我尝试了“dplyr”package,但我不知道如何引用与变量 (A) 相关的所有三列(A、Group_A 和 WOE_A)以及如何汇总所有所需统计信息的信息。

The code I began with is:我开始的代码是:

df <- data
List <- list(df)
for (colname in colnames(df)) {
  List[[colname]]<- df %>%
    group_by(df[,colname]) %>%
    count()
}
List

This is how I want to print results:这就是我想要打印结果的方式:

**Var A                       
Group   Min(A)  Max(A)  WOE_A   Count(Group_A)  Count_1(Group_A, where Status=1)  Count_0(Group_A, where Status=0)**
1                       
2                       
3                       
4                       
5   

Thank you very much!非常感谢!

Laura劳拉

Laura, as mentioned by the others, working with "long" data frames is better than with wide data frames. Laura,正如其他人所提到的,使用“长”数据帧比使用宽数据帧更好。

Your initial idea using dplyr and group_by() got you almost there.您最初使用dplyrgroup_by()的想法让您几乎实现了目标。 Note: this is also a way to break down your data and then combine it with generic columns, if the wide-long is pushing the limits.注意:这也是一种分解数据然后将其与通用列组合的方法,如果宽-长正在突破极限。

Let's start with this:让我们从这个开始:

library(dplyr)

#---------- extract all "A" measurements
df %>% 
   select(A, Group_A, WOE_A, Status) %>% 
#---------- grouped summary of multiple stats
   group_by(A) %>% 
   summarise(
       Min = min(A)
    ,  Max = max(A)
    ,  WOE_A = unique(WOE_A) 
    ,   Count = n()    # n() is a helper function of dplyr
    ,  CountStatus1 = sum(Status == 1)  # use sum() to count logical conditions
    ,  CountStatus0 = sum(Status == 0)
)

This yields:这产生:

# A tibble: 6 x 7
      A   Min   Max WOE_A Count CountStatus1 CountStatus0
  <dbl> <dbl> <dbl> <dbl> <int>        <int>        <int>
1     0     0     0  0.87     4            2            2
2     1     1     1 -0.04     1            0            1
3     2     2     2 -0.28     2            0            2
4     7     7     7 -0.69     1            0            1
5     8     8     8 -0.69     1            0            1
6    12    12    12  0.08     1            0            1

OK.好的。 Turning your wide dataframe into a long one is not a trivial go as you nest measurements and variable names.在嵌套测量和变量名称时,将宽 dataframe 变成长 go 并非易事。 On top, ID and Status are ids/key variables for each row.最重要的是, IDStatus是每一行的 ids/key 变量。

The standard tool to convert wide to long is tidyr 's pivot_longer() .将宽转换为长的标准工具是tidyrpivot_longer() Read up on this.阅读此内容。 In your particular case we want to push multiple columns into multiple targets.在您的特定情况下,我们希望将多个列推送到多个目标中。 For this you need to get a feel for the .value sentinel.为此,您需要了解.value哨兵。 The pivot_longer() help pages might be useful for studying this case. pivot_longer()帮助页面可能有助于研究这种情况。

To ease the pain of constructing a complex regex expression to decode the variable names, I rename your group-id-label , eg A, B, to X_A , X_B . This ensures that all column-names are built in the form of为了减轻构建复杂正则表达式来解码变量名称的痛苦,我将您的group-id-label (例如 A、B)重命名为X_A 、 X_B . This ensures that all column-names are built in the form of . This ensures that all column-names are built in the form of what_letter`! . This ensures that all column-names are built in the form of构建!

library(tidyr)

    df %>% 
    # ----------- prepare variable names to be well-formed, you may do this upstream
      rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
     
    # ----------- call pivot longer with .value sentinel and names_pattern
    # ----------- that is an advanced use of the capabilities
      pivot_longer(
          cols = -c("ID","Status")         # apply to all cols besides ID and Status
       , names_to = c(".value", "label")   # target column names are based on origin names
                                           # and an individual label (think id, name as u like)
       , names_pattern = "(.*)(.*_[A-D]{1})$")  # regex for the origin column patterns
                                                # pattern is built of 2 parts (...)(...)
                                                # (.*) no or any symbol possibly multiple times
                                                # (.*_[A-D]{1}) as above, but ending with underscore and 1 letter 

This gives you这给你

# A tibble: 40 x 6
      ID Status label     X Group   WOE
   <dbl>  <dbl> <chr> <dbl> <dbl> <dbl>
 1   213      1 _A      0       1  0.87
 2   213      1 _B      0       1  0.65
 3   213      1 _C      0       1  0.8 
 4   213      1 _D    916.      4 -0.3 
 5   321      0 _A     12       5  0.08
 6   321      0 _B      4       4 -0.43
 7   321      0 _C      6       5 -0.2 
 8   321      0 _D     85.3     2  0.26
 9    32      1 _A      0       1  0.87
10    32      1 _B      0       1  0.65

Putting all together把所有的放在一起

df %>% 
# ------------ prepare and make long
   rename(X_A = A, X_B = B, X_C = C, X_D = D) %>% 
   pivot_longer(cols = -c("ID","Status")
               , names_to = c(".value", "label")
               , names_pattern = "(.*)(.*_[A-D]{1})$") %>% 

# ------------- calculate stats on groups
  group_by(label, X) %>% 
  summarise(Min = min(X),  Max = max(X),  WOE = unique(WOE)
           ,Count = n(),  CountStatus1 = sum(Status == 1)
           , CountStatus0 = sum(Status == 0)
)

Voila:瞧:

# A tibble: 27 x 8
# Groups:   label [4]
   label     X   Min   Max   WOE Count CountStatus1 CountStatus0
   <chr> <dbl> <dbl> <dbl> <dbl> <int>        <int>        <int>
 1 _A        0     0     0  0.87     4            2            2
 2 _A        1     1     1 -0.04     1            0            1
 3 _A        2     2     2 -0.28     2            0            2
 4 _A        7     7     7 -0.69     1            0            1
 5 _A        8     8     8 -0.69     1            0            1
 6 _A       12    12    12  0.08     1            0            1
 7 _B        0     0     0  0.65     5            2            3
 8 _B        1     1     1 -0.49     1            0            1
 9 _B        2     2     2 -0.82     2            0            2
10 _B        3     3     3 -0.43     1            0            1
# ... with 17 more rows

The loop that I managed to do is available below.我设法做的循环在下面可用。 Apart from the tables that I wanted to list, I also needed to make a chart which would show some of the information from each listed table, and then print a PDF with each variable and corresponding table and chart on a different page.除了我想列出的表格之外,我还需要制作一个图表来显示每个列出的表格中的一些信息,然后在不同的页面上打印一个 PDF,其中包含每个变量以及相应的表格和图表。

    data <- as.data.frame(data)
    
    # 5 is the column where my first information related to a variable is, so for each variable I am building the data with its' related columns
    i <- 5 
    #each variable has 3 columns (Value, Group, WOE)
    for (i in seq(5, 223, 3)){   
    ID <- data[,1]
    A <- data[,i]
    Group <- data[,i+1]
    WOE <- data[,i+2]
    Status <- data[,224]
    df <- cbind(ID, A, Group, WOE, Status) 
    df <- data.frame(df)
    
    # Perform table T with its' corresponding statistics
    T <- df %>% 
    select(A, Group, WOE, Status) %>% 
    group_by(Group) %>% 
    summarise(
      Min = min(A, na.rm=TRUE),  Max = max(A, na.rm=TRUE),  WOE = unique(WOE),   
      Count = n(), 
      CountStatus1 = sum(Status == 1), 
      CountStatus0 = sum(Status == 0),
      BadRate = round((CountStatus1/Count)*100,1))
      print(colnames(data)[i])
      print(T)

    # Then I plot some information from Table T
    p <- ggplot(T) + geom_col(aes(x=Group, y=CountStatus1), size = 1, color = "darkgreen", fill = "darkgreen")
    p <- p + geom_line(aes(x=Group, y=WOE*1000), col="firebrick", size=0.9) + 
    geom_point(aes(x=Group, y=WOE*1000), col="gray", size=3) + 
    ggtitle(label = paste("WOE and Event Count by Group", " - " , colnames(data)[i])) + 
    labs(x = "Group", y = "Event Count", size=7) +
    theme(plot.title = element_text(size=8, face="bold", margin = margin(10, 0, 10, 0)), 
          axis.text.x = element_text(angle=0, hjust = 1)) +
    scale_y_continuous(sec.axis = sec_axis(trans = ~ . /1000, name="WOE", breaks = seq(-3, 5, 0.5)))
    print(p)
}

The information is printed for all the variables that I need as in the pictures below:为我需要的所有变量打印信息,如下图所示:

Table for one of the variables变量之一的表

Chart for the same variable相同变量的图表

However, now I encounter some problems with exporting results in a pdf.但是,现在我在 pdf 中导出结果时遇到了一些问题。 I do not know how I could print the results of each table and chart on a distinct page in a PDF.我不知道如何在 PDF 的不同页面上打印每个表格和图表的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM