I have recently started working with R. I have a dataset which is composed of two columns and 100000 rows as shown below:
Y TOTA
1 1 403500.000
2 1 188334.000
3 0 812387.000
4 0 163626.000
5 1 49527.000
6 1 48661.000
7 0 36712.000
8 1 31745.000
9 1 23342.000
10 0 46835.000
...... . .........
100000 0 10.982
The variable Y can have just two values: 0 or 1, whereas the variable TOTA can have various values. The function summary gives me the following result:
Y TOTA
Min. :0.0000 Min. : 0
1st Qu.:0.0000 1st Qu.: 939
Median :1.0000 Median : 3918
Mean :0.5113 Mean : 40245
3rd Qu.:1.0000 3rd Qu.: 11028
Max. :1.0000 Max. :18938000
NA's :261
AIM:
I would like to create a table with 10 rows and 3 columns. Each row represents a decile of my dataset and the last one shows NAs. Now I would like to populate my table looking at the dataset. If the first column in the dataset is 1 then add +1 to the created table where the value matches the value range of one of the columns and the column "Number Active Companies". If the first value is 0 then add +1 in the column of "Number Passive Companies" in the respective row where the value matches the table value ranges. Each row of the table represents a different range for the variable TOTA
WHAT I HAVE ATTEMPTED
What I have tried so far is to create a table which will contain the result of my dataset processing
Number Active Companies Number Passive Companies Total
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
10 0 0 0
result<-matrix(data = 0, nrow = 10, ncol = 3, byrow = FALSE, dimnames = list(1:10,c("Number Active Companies","Number Passive Companies","Total")));
Afterwards I have created 10 groups which contain different range of my variable:
x > 0 && x < 100
x > 100 && x < 1000
x > 1000 && x < 10000
x > 10000 && x < 100000
x > 100000 && x < 1000000
x > 1000000 && x < 1000000
x > 5938000 && x < 10938000
x > 10938000 && x < 15938000
x > 15938000 && x < 18938000
x=NA
Now I would like to populate the previous table in this way. I want to analyse each row of the Y variable if it is 1 it should add 1 to the column number active companies and in the row in which the number belong to anc the same when Y is zero.
for(i in TOTA){
if (Y=1)
if(x > 0 && x < 100){
}else if(x > 100 && x < 1000){
}else if(x > 1000 && x < 10000){
}else if(x > 10000 && x < 100000){
}else if(x > 100000 && x < 1000000){
}else if( x > 1000000 && x < 1000000){
}else if( x > 1000000 && x < 1000000){
}else if( x > 5938000 && x < 10938000){
}else if( x > 10938000 && x < 15938000){
}else if( x > 15938000 && x < 18938000) {
}else{
//Nas
}
}else if(Y=0){
if(x > 0 && x < 100){
}else if(x > 100 && x < 1000){
}else if(x > 1000 && x < 10000){
}else if(x > 10000 && x < 100000){
}else if(x > 100000 && x < 1000000){
}else if( x > 1000000 && x < 1000000){
}else if( x > 1000000 && x < 1000000){
}else if( x > 5938000 && x < 10938000){
}else if( x > 10938000 && x < 15938000){
}else if( x > 15938000 && x < 18938000) {
}else{
//Nas
}
}
QUESTIONS
How can I write in the table? How can I do this process in a easier manner? How can I create an histogram of this table?
I am wondering whether I am doing the right thing, given the fact I have read the manual for the functions quantile() and percentile() and it seems they do the same thing
Can you please give me some guideline and possibly some commands to achieve my aim
Thank you
Still difficult to figure out what you are trying to accomplish, but this is my best guess:
# create reproducible example - you already have this...
set.seed(1)
df <- data.frame(Y=sample(0:1,100000,replace=T),
TOTA=runif(100000,0,18938000))
na <- sample(1:100000,5000) # 5% NA
df[na,]$TOTA <- NA
# you start here...
breaks <- c(0,10^(2:6), 5938000, 10938000, 15938000, 18938000)
labels <- c("0-100","100-1000","1000-10000","10000-100000",
"100000-100000","100000-1000000","1000000-5938000",
"5938000-10938000","10938000-18938000","NA")
df$group <- cut(df$TOTA,breaks=breaks,labels=F)
df[is.na(df$group),]$group <- 10
df$grpLabel <- labels[df$group]
result <- aggregate(Y~group,df,function(x)sum(x==1))
colnames(result) <- c("Group","Active")
result$Passive <- aggregate(Y~group,df,function(x)sum(x==0))$Y
result$Group <- labels[result$Group]
result
# Group Active Passive
# 1 0-100 0 1
# 2 100-1000 1 2
# 3 1000-10000 29 17
# 4 10000-100000 224 212
# 5 100000-100000 2310 2288
# 6 100000-1000000 12365 12328
# 7 1000000-5938000 12508 12522
# 8 5938000-10938000 12526 12649
# 9 10938000-18938000 7485 7533
# 10 NA 2544 2456
So this divides the dataset into groups using cut(...)
, then sums the 1s
and 0s
separately using aggregate(...)
, then labels the groups. Normally you could use cut(...)
without labels=F
and get meaningful labels for your groups directly. The problem here is that aggregate(...)
will sort these alphabetically, which is not what you want.
Also, note that in your question you have a range 1000000 - 1000000 (eg 1MM to 1MM). I assumed this is supposed to be 1000000 - 5938000.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.