I have a dataset containing about 60 variables (A, B, C, D, ...), each with 3 corresponding information columns (A, Group_A and WOE_A) as in the list below:
ID A Group_A WOE_A B Group_B WOE_B C Group_C WOE_C D Group_D WOE_D Status
213 0 1 0.87 0 1 0.65 0 1 0.80 915.7 4 -0.30 1
321 12 5 0.08 4 4 -0.43 6 5 -0.20 85.3 2 0.26 0
32 0 1 0.87 0 1 0.65 0 1 0.80 28.6 2 0.26 1
13 7 4 -0.69 2 3 -0.82 4 4 -0.80 31.8 2 0.26 0
43 1 2 -0.04 1 2 -0.49 1 2 -0.22 51.7 2 0.26 0
656 2 3 -0.28 2 3 -0.82 2 3 -0.65 8.5 1 1.14 0
435 2 3 -0.28 0 1 0.65 0 1 0.80 39.8 2 0.26 0
65 8 4 -0.69 3 4 -0.43 5 4 -0.80 243.0 3 0.00 0
565 0 1 0.87 0 1 0.65 0 1 0.80 4.0 1 1.14 0
432 0 1 0.87 0 1 0.65 0 1 0.80 81.6 2 0.26 0
I want to print a table in R with some statistics ( Min(A), Max(A), WOE_A, Count(Group_A), Count(Group_A, where Status=1), Count(Group_A, where Status=0)
), all grouped by Group for each of the 60 variables and I think I need to perform it in a loop. I tried the "dplyr" package, but I don't know how to refer to all the three columns (A, Group_A and WOE_A) that relate to a variable (A) and also how to summarize the information for all the desired statistics.
The code I began with is:
df <- data
List <- list(df)
for (colname in colnames(df)) {
List[[colname]]<- df %>%
group_by(df[,colname]) %>%
count()
}
List
This is how I want to print results:
**Var A
Group Min(A) Max(A) WOE_A Count(Group_A) Count_1(Group_A, where Status=1) Count_0(Group_A, where Status=0)**
1
2
3
4
5
Thank you very much!
Laura
Laura, as mentioned by the others, working with "long" data frames is better than with wide data frames.
Your initial idea using dplyr
and group_by()
got you almost there. Note: this is also a way to break down your data and then combine it with generic columns, if the wide-long is pushing the limits.
Let's start with this:
library(dplyr)
#---------- extract all "A" measurements
df %>%
select(A, Group_A, WOE_A, Status) %>%
#---------- grouped summary of multiple stats
group_by(A) %>%
summarise(
Min = min(A)
, Max = max(A)
, WOE_A = unique(WOE_A)
, Count = n() # n() is a helper function of dplyr
, CountStatus1 = sum(Status == 1) # use sum() to count logical conditions
, CountStatus0 = sum(Status == 0)
)
This yields:
# A tibble: 6 x 7
A Min Max WOE_A Count CountStatus1 CountStatus0
<dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 0 0 0 0.87 4 2 2
2 1 1 1 -0.04 1 0 1
3 2 2 2 -0.28 2 0 2
4 7 7 7 -0.69 1 0 1
5 8 8 8 -0.69 1 0 1
6 12 12 12 0.08 1 0 1
OK. Turning your wide dataframe into a long one is not a trivial go as you nest measurements and variable names. On top, ID
and Status
are ids/key variables for each row.
The standard tool to convert wide to long is tidyr
's pivot_longer()
. Read up on this. In your particular case we want to push multiple columns into multiple targets. For this you need to get a feel for the .value
sentinel. The pivot_longer()
help pages might be useful for studying this case.
To ease the pain of constructing a complex regex expression to decode the variable names, I rename your group-id-label
, eg A, B, to X_A
, X_B . This ensures that all column-names are built in the form of
. This ensures that all column-names are built in the form of
what_letter`!
library(tidyr)
df %>%
# ----------- prepare variable names to be well-formed, you may do this upstream
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
# ----------- call pivot longer with .value sentinel and names_pattern
# ----------- that is an advanced use of the capabilities
pivot_longer(
cols = -c("ID","Status") # apply to all cols besides ID and Status
, names_to = c(".value", "label") # target column names are based on origin names
# and an individual label (think id, name as u like)
, names_pattern = "(.*)(.*_[A-D]{1})$") # regex for the origin column patterns
# pattern is built of 2 parts (...)(...)
# (.*) no or any symbol possibly multiple times
# (.*_[A-D]{1}) as above, but ending with underscore and 1 letter
This gives you
# A tibble: 40 x 6
ID Status label X Group WOE
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 213 1 _A 0 1 0.87
2 213 1 _B 0 1 0.65
3 213 1 _C 0 1 0.8
4 213 1 _D 916. 4 -0.3
5 321 0 _A 12 5 0.08
6 321 0 _B 4 4 -0.43
7 321 0 _C 6 5 -0.2
8 321 0 _D 85.3 2 0.26
9 32 1 _A 0 1 0.87
10 32 1 _B 0 1 0.65
Putting all together
df %>%
# ------------ prepare and make long
rename(X_A = A, X_B = B, X_C = C, X_D = D) %>%
pivot_longer(cols = -c("ID","Status")
, names_to = c(".value", "label")
, names_pattern = "(.*)(.*_[A-D]{1})$") %>%
# ------------- calculate stats on groups
group_by(label, X) %>%
summarise(Min = min(X), Max = max(X), WOE = unique(WOE)
,Count = n(), CountStatus1 = sum(Status == 1)
, CountStatus0 = sum(Status == 0)
)
Voila:
# A tibble: 27 x 8
# Groups: label [4]
label X Min Max WOE Count CountStatus1 CountStatus0
<chr> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 _A 0 0 0 0.87 4 2 2
2 _A 1 1 1 -0.04 1 0 1
3 _A 2 2 2 -0.28 2 0 2
4 _A 7 7 7 -0.69 1 0 1
5 _A 8 8 8 -0.69 1 0 1
6 _A 12 12 12 0.08 1 0 1
7 _B 0 0 0 0.65 5 2 3
8 _B 1 1 1 -0.49 1 0 1
9 _B 2 2 2 -0.82 2 0 2
10 _B 3 3 3 -0.43 1 0 1
# ... with 17 more rows
The loop that I managed to do is available below. Apart from the tables that I wanted to list, I also needed to make a chart which would show some of the information from each listed table, and then print a PDF with each variable and corresponding table and chart on a different page.
data <- as.data.frame(data)
# 5 is the column where my first information related to a variable is, so for each variable I am building the data with its' related columns
i <- 5
#each variable has 3 columns (Value, Group, WOE)
for (i in seq(5, 223, 3)){
ID <- data[,1]
A <- data[,i]
Group <- data[,i+1]
WOE <- data[,i+2]
Status <- data[,224]
df <- cbind(ID, A, Group, WOE, Status)
df <- data.frame(df)
# Perform table T with its' corresponding statistics
T <- df %>%
select(A, Group, WOE, Status) %>%
group_by(Group) %>%
summarise(
Min = min(A, na.rm=TRUE), Max = max(A, na.rm=TRUE), WOE = unique(WOE),
Count = n(),
CountStatus1 = sum(Status == 1),
CountStatus0 = sum(Status == 0),
BadRate = round((CountStatus1/Count)*100,1))
print(colnames(data)[i])
print(T)
# Then I plot some information from Table T
p <- ggplot(T) + geom_col(aes(x=Group, y=CountStatus1), size = 1, color = "darkgreen", fill = "darkgreen")
p <- p + geom_line(aes(x=Group, y=WOE*1000), col="firebrick", size=0.9) +
geom_point(aes(x=Group, y=WOE*1000), col="gray", size=3) +
ggtitle(label = paste("WOE and Event Count by Group", " - " , colnames(data)[i])) +
labs(x = "Group", y = "Event Count", size=7) +
theme(plot.title = element_text(size=8, face="bold", margin = margin(10, 0, 10, 0)),
axis.text.x = element_text(angle=0, hjust = 1)) +
scale_y_continuous(sec.axis = sec_axis(trans = ~ . /1000, name="WOE", breaks = seq(-3, 5, 0.5)))
print(p)
}
The information is printed for all the variables that I need as in the pictures below:
Table for one of the variables
However, now I encounter some problems with exporting results in a pdf. I do not know how I could print the results of each table and chart on a distinct page in a PDF.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.