Please consider the following data frame
#build sample data.frame
theData <- data.frame(surname = c("Smith","Parker", "Allen", "McGraw", "Parker", "Smith", "Smith"),
FamilySize = c(3, 2, 1, 1, 2, 3, 3))
First I need to verify that the number of persons sharing the same surname corresponds to the size of the family they belong to. For example, there are 3 persons with surname = "Smith"
, and the FamilySize
variable for each of them is 3. If this condition is satisfied the size of the family is appended to the surname (eg "3Smith"
); if not the result should be the word "small"
.
For this purpose I have written this function:
# function
familyKount <- function(df, lastName, famSize){
# calculate number of persons sharing same surname
nPersons <- dim(subset(df, surname == lastName))[1]
# number of persons agrees with family size
if(nPersons == famSize) {
idFam <- paste(as.character(famSize), lastName, sep="")
} else { # number of persons does not agree with family size
idFam <- "small"
}
idFam
}
So if I invoke this function as follows
familyKount(theData, theData$surname[1], theData$FamilySize[1])
I obtain the correct answer: "3Smith"
.
However, what I would like is to apply this function to the whole data frame, without having to specify an index for surname
and FamilySize
(I don't want to use a for
loop). I have tried variations of the apply
family of functions but I haven't figured out how to pass a whole data frame as well as specific columns of it as arguments of a function in this kind of situation.
Cheers
There are many solutions to this. You could for instance use table:
table(theData$surname)
## Allen McGraw Parker Smith
## 1 1 2 3
Or with dplyr
:
library(dplyr)
group_by(theData, surname) %>%
summarize(SizeCalculated = n()
## Source: local data frame [4 x 2]
##
## surname SizeCalculated
## (fctr) (int)
## 1 Allen 1
## 2 McGraw 1
## 3 Parker 2
## 4 Smith 3)
Or with aggregate()
:
aggregate(theData, list(theData$surname), length)
## Group.1 surname FamilySize
## 1 Allen 1 1
## 2 McGraw 1 1
## 3 Parker 2 2
## 4 Smith 3 3
You can also find a solution with sapply()
that is probably similar to what you intended:
surnames <- unique(theData$surname)
counts <- sapply(surnames, function(s) sum(theData$surname == s))
data.frame(surnames, counts)
## surnames counts
## 1 Smith 3
## 2 Parker 2
## 3 Allen 1
## 4 McGraw 1
The idea is to apply over the surnames.
All these solutions can be extended to include the check of FamilySize
from theData
. For example, the aggregate()
-solution:
tab <- aggregate(theData, list(theData$surname), length)
tab$size_check <- tab$surname == tab$FamilySize
tab
## Group.1 surname FamilySize size_check
## 1 Allen 1 1 TRUE
## 2 McGraw 1 1 TRUE
## 3 Parker 2 2 TRUE
## 4 Smith 3 3 TRUE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.