Passing a function depending on a data frame subset as well as data frame columns to sapply in R

Question

Please consider the following data frame

#build sample data.frame
theData <- data.frame(surname = c("Smith","Parker", "Allen", "McGraw", "Parker", "Smith", "Smith"),
                     FamilySize = c(3, 2, 1, 1, 2, 3, 3))

First I need to verify that the number of persons sharing the same surname corresponds to the size of the family they belong to. For example, there are 3 persons with surname = "Smith" , and the FamilySize variable for each of them is 3. If this condition is satisfied the size of the family is appended to the surname (eg "3Smith" ); if not the result should be the word "small" .

For this purpose I have written this function:

# function
familyKount <- function(df, lastName, famSize){
    # calculate number of persons sharing same surname
    nPersons <- dim(subset(df, surname == lastName))[1]

    # number of persons agrees with family size
    if(nPersons == famSize) {
            idFam <- paste(as.character(famSize), lastName, sep="")
    } else {                # number of persons does not agree with family size
            idFam <- "small"
    }
    idFam
}

So if I invoke this function as follows

familyKount(theData, theData$surname[1], theData$FamilySize[1])

I obtain the correct answer: "3Smith" .

However, what I would like is to apply this function to the whole data frame, without having to specify an index for surname and FamilySize (I don't want to use a for loop). I have tried variations of the apply family of functions but I haven't figured out how to pass a whole data frame as well as specific columns of it as arguments of a function in this kind of situation.

Cheers

Answer 1

There are many solutions to this. You could for instance use table:

table(theData$surname)

##  Allen McGraw Parker  Smith 
##      1      1      2      3

Or with dplyr :

library(dplyr)
group_by(theData, surname) %>%
  summarize(SizeCalculated = n()
## Source: local data frame [4 x 2]
## 
##   surname SizeCalculated
##    (fctr)          (int)
## 1   Allen              1
## 2  McGraw              1
## 3  Parker              2
## 4   Smith              3)

Or with aggregate() :

aggregate(theData, list(theData$surname), length)
##   Group.1 surname FamilySize
## 1   Allen       1          1
## 2  McGraw       1          1
## 3  Parker       2          2
## 4   Smith       3          3

You can also find a solution with sapply() that is probably similar to what you intended:

surnames <- unique(theData$surname)
counts <- sapply(surnames, function(s) sum(theData$surname == s))
data.frame(surnames, counts)
##   surnames counts
## 1    Smith      3
## 2   Parker      2
## 3    Allen      1
## 4   McGraw      1

The idea is to apply over the surnames.

All these solutions can be extended to include the check of FamilySize from theData . For example, the aggregate() -solution:

tab <- aggregate(theData, list(theData$surname), length)
tab$size_check <- tab$surname == tab$FamilySize
tab
##   Group.1 surname FamilySize size_check
## 1   Allen       1          1       TRUE
## 2  McGraw       1          1       TRUE
## 3  Parker       2          2       TRUE
## 4   Smith       3          3       TRUE

Passing a function depending on a data frame subset as well as data frame columns to sapply in R

Question

1 answers

solution1
1 2016-03-01 18:27:58

Passing a function depending on a data frame subset as well as data frame columns to sapply in R

Question

1 answers

solution1 1 2016-03-01 18:27:58

solution1
1 2016-03-01 18:27:58