In Stata I am able to use the codebookout
command to create an Excel workbook that saves name, label, and storage type of all the variables in the existing dataset with their corresponding values and value labels.
I would like to find an equivalent function in R. So far, I've encountered the memisc
library which has a function called codebook
, but it does not do the same thing as in Stata.
For example, In Stata, the output of the codebook would look like this...(see below - this is what I want)
Variable Name Variable Label Answer Label Answer Code Variable Type
hhid hhid Open ended String
inter_month inter_month Open ended long
year year Open ended long
org_unit org_unit long
Balaka 1
Blantyre 2
Chikwawa 3
Chiradzulu 4
ie each column in the data frame is evaluated to produce values for 5 different columns:
Here is my attempt:
CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))
table <- data.frame()
for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]
df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)
}
return(table)
}
Unfortunately, I am getting the following warning message:
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) :
invalid factor level, NA generated
The output I produce results in the Answer Code label getting messed up:
Variable Name Variable Label Variable Type Answer Code Answer Label
hhid hhid hhid Open ended character
month month month Open ended integer
year year year Open ended integer
org_unit org_unit org_unit Open ended character
v000 v000 v000 Open ended character
v001 v001 v001 Open ended integer
v002 v002 v002 Open ended integer
v003 v003 v003 Open ended integer
v005 v005 v005 Open ended integer
v006 v006 v006 Open ended integer
v007 v007 v007 Open ended integer
v021 v021 v021 Open ended numeric
2285 v024 v024 central <NA> factor
1 north <NA>
7119 south <NA>
11 v025 v025 rural <NA> factor
1048 v025 v025 urban <NA> factor
district_name district_name district_name Open ended character
coords_x1 coords_x1 coords_x1 Open ended numeric
coords_x2 coords_x2 coords_x2 Open ended numeric
itn_color itn_color itn_color Open ended numeric
piped piped piped Open ended numeric
sanit sanit sanit Open ended numeric
sanit_cd sanit_cd sanit_cd Open ended numeric
water water water Open ended numeric
I decided to take a crack at this for my own amusement. I used the built-in Titanic
data set. I had an issue with one of your definitions, though: you say "If there are no unique values, it is considered open ended". But every variable of length >0 has some unique values: did you mean "if every value is unique"? Even this definition doesn't necessarily work as expected: in the Titanic
data set, the responses are integer, and there happen to be only 22 unique values out of 32 total values. I didn't think that one would really want this to be enumerated, so I tested for type of factor
instead (but you could substitute the length(u)==length(x)
line below if you really want).
## utility function: pad vector with blanks to specified length
pad <- function(x,n,p="") {
return(c(x,rep(p,n-length(x))))
}
## process a single column
proc_col <- function(x,nm) {
u <- unique(x)
## if (length(u)==length(x)) {
if (!is.factor(x)) {
n <- 1
u <- "open ended"
cc <- ""
} else {
cc <- as.numeric(u)
n <- length(u)
}
dd <- data.frame(`Variable Name`=pad(nm,n),
`Variable Label`=pad(nm,n),
`Answer Label`=u,
`Answer Code`=cc,
`Variable Type`=pad(class(x),n),
stringsAsFactors=FALSE)
return(dd)
}
## process all columns
proc_df <- function(x) {
L <- Map(proc_col,x,names(x))
dd <- do.call(rbind,L)
rownames(dd) <- NULL
return(dd)
}
Example:
xx <- as.data.frame.table(Titanic)
proc_df(xx)
## Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type
## 1 Class Class 1st 1 factor
## 2 2nd 2
## 3 3rd 3
## 4 Crew 4
## 5 Sex Sex Male 1 factor
## 6 Female 2
## 7 Age Age Child 1 factor
## 8 Adult 2
## 9 Survived Survived No 1 factor
## 10 Yes 2
## 11 Freq Freq open ended numeric
I didn't leave blank spaces before the lists of code values etc., but you can make those adjustments yourself ...
Here is my crack at a solution:
CreateCodebook <- function(dF){
numbercols <- length(colnames(dF))
table <- data.frame()
for (i in 1:length(colnames(dF))){
AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i],
rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]
df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE)
names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
table <- rbind(table, df)
}
rownames(table) <- 1:nrow(table)
return(table)
}
Output:
Variable Name Variable Label Variable Type Answer Code Answer Label
1 brid brid Open ended character
2 month month Open ended integer
3 year year Open ended integer
4 org_unit org_unit Open ended character
5 v000 v000 Open ended character
6 v001 v001 Open ended integer
7 v002 v002 Open ended integer
8 v003 v003 Open ended integer
9 v005 v005 Open ended integer
10 v006 v006 Open ended integer
11 v007 v007 Open ended integer
12 v021 v021 Open ended numeric
13 v024 v024 central 1 factor
14 north 2
15 south 3
16 v025 v025 rural 1 factor
17 urban 2
18 bidx bidx Open ended integer
19 district_name district_name Open ended character
20 coords_x1 coords_x1 Open ended numeric
21 coords_x2 coords_x2 Open ended numeric
22 anc4 anc4 Open ended numeric
23 antimal_48 antimal_48 Open ended numeric
24 carep carep Open ended numeric
25 csec csec Open ended numeric
26 dptv dptv Open ended numeric
27 ebreast ebreast Open ended numeric
28 fans_48 fans_48 Open ended numeric
29 ideliv ideliv Open ended numeric
30 iptp iptp Open ended numeric
31 iron90 iron90 Open ended numeric
32 measlesv measlesv Open ended numeric
33 ors ors Open ended numeric
34 ort ort Open ended numeric
35 pncwm pncwm Open ended numeric
36 sstools sstools Open ended numeric
37 tt tt Open ended numeric
38 vita vita Open ended numeric
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.