简体   繁体   中英

how to generate grouped result in R?

Here is my data

> str(myData)
'data.frame':   500 obs. of  12 variables:
 $ PassengerId: int  1 2 5 6 7 8 9 10 11 12 ...
 $ Survived   : int  0 1 0 0 0 0 1 1 1 1 ...
 $ Pclass     : int  3 1 3 3 1 3 3 2 3 1 ...
 $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 16 559 520 629 417 581 732 96 ...
 $ Sex        : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 1 1 1 1 ...
 $ Age        : num  22 38 35 NA 54 2 27 14 4 58 ...
 $ SibSp      : int  1 1 0 0 0 3 0 1 1 0 ...
 $ Parch      : int  0 0 0 0 0 1 2 0 1 0 ...
 $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 473 276 86 396 345 133 617 39 ...
 $ Fare       : num  7.25 71.28 8.05 8.46 51.86 ...
 $ Cabin      : Factor w/ 147 levels "A10","A14","A16",..: NA 82 NA NA 130 NA NA NA 146 50 ...
 $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 2 3 3 3 1 3 3 ...

I have to generate 2 results

1.grouped by title and pclass of each passenger like this

2.display table of missing age counts grouped by title and pclass like this

but when I used what I know both resulted like below

> myData$Name = as.character(myData$Name)
> table_words = table(unlist(strsplit(myData$Name, "\\s+")))
> sort(table_words [grep('\\.',names(table_words))], decreasing=TRUE)

      Mr.     Miss.      Mrs.   Master.       Dr.      Rev.      Col.     Capt. Countess.      Don. 
      289        99        76        20         5         3         2         1         1         1 
       L.     Mlle.      Mme.      Sir. 
        1         1         1         1 
> library(stringr)
> tb = cbind(myData$Age, str_match(myData$Name, "[a-zA-Z]+\\."))
> table(tb[is.na(tb[,1]),2])

    Dr. Master.   Miss.     Mr.    Mrs. 
      1       3      18      62       7 

basically I have to return tables not by total amount like I did above but to display by 3 different rows sorting by Pclass int which the total of 3rows would still be the same as total amount(myTitle = Pclass int 1 / 2 / 3 in 'myData')

so for example, the result of image 1 would mean that Capt. exists only 1 by int 1 unber Pclass data.

how should i sort the total amount by Pclass int 1,2,and 3?

It is hard to tell with no data provided (though I think that it comes from the Titanic dataset on Kaggle).

I think the first thing to do is to create a new factor with Title as you want to make analysis with it. I'd do something like:

# Extract title from name and make it a factor
dat$Title <- gsub(".* (.*)\\. .*$", "\\1", as.character(dat$Name))
dat$Title <- factor(dat$Title)

You'll need to check that it works with your data.

Once you have the Title factor you can use ddply from the plyr library and make the first table (grouped by Title and Pclass of each passenger):

library(plyr)
# Number of occurences
classTitle <- ddply(dat, c('Pclass', 'Title'), summarise,
                    count=length(Name))
# Convert to wide format
classTitle <- reshape(classTitle, idvar = "Title", timevar = "Pclass",
                      direction = "wide")
# Fill NA's with 0
classTitle[is.na(classTitle)] <- 0

Almost the same thing for your second requirement (display table of missing age counts grouped by Title and Pclass ):

# Number of NA in Age
countNA <- ddply(dat, c('Pclass', 'Title'), summarise,
                    na=sum(is.na(Age)))
# Convert to wide format
countNA <- reshape(countNA, idvar = "Title", timevar = "Pclass",
                      direction = "wide")
# Fill NA's with 0
countNA[is.na(countNA)] <- 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM