简体   繁体   中英

R: Is there an equivalent to Stata's codebookout command?

In Stata I am able to use the codebookout command to create an Excel workbook that saves name, label, and storage type of all the variables in the existing dataset with their corresponding values and value labels.

I would like to find an equivalent function in R. So far, I've encountered the memisc library which has a function called codebook , but it does not do the same thing as in Stata.

For example, In Stata, the output of the codebook would look like this...(see below - this is what I want)

Variable Name   Variable Label    Answer Label  Answer Code    Variable Type
    hhid               hhid           Open ended                    String
    inter_month        inter_month    Open ended                    long
    year               year           Open ended                    long
    org_unit           org_unit                                     long
                                      Balaka         1  
                                      Blantyre       2  
                                      Chikwawa       3  
                                      Chiradzulu     4  

ie each column in the data frame is evaluated to produce values for 5 different columns:

  • Variable Name which is the name of the column
  • Variable Label which is the name of the column
  • Answer Label which is the unique values in the column. If there are no unique values, it is considered open ended
  • Answer Code which is the numerical assignment to each category in the Answer Label. Blank if the Answer Label is not categorical.
  • Variable Type: int, str, long (date)...

Here is my attempt:

CreateCodebook <- function(dF){
  numbercols <- length(colnames(dF))

  table <- data.frame()

  for (i in 1:length(colnames(dF))){
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
    VariableName <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                                                  rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableLabel <- if (length(AnswerCode) - 1 > 1) c(colnames(dF)[i], 
                                                   rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableType <- if (length(AnswerCode) - 1 > 1) c(sapply(dF, class)[i], 
                                                  rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType)
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
    table <- rbind(table, df)

  }
  return(table)
}

Unfortunately, I am getting the following warning message:

Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = 1:3) :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = 1:2) :
  invalid factor level, NA generated

The output I produce results in the Answer Code label getting messed up:

              Variable Name Variable Label Variable Type Answer Code Answer Label
hhid                   hhid           hhid    Open ended                character
month                 month          month    Open ended                  integer
year                   year           year    Open ended                  integer
org_unit           org_unit       org_unit    Open ended                character
v000                   v000           v000    Open ended                character
v001                   v001           v001    Open ended                  integer
v002                   v002           v002    Open ended                  integer
v003                   v003           v003    Open ended                  integer
v005                   v005           v005    Open ended                  integer
v006                   v006           v006    Open ended                  integer
v007                   v007           v007    Open ended                  integer
v021                   v021           v021    Open ended                  numeric
2285                   v024           v024       central        <NA>       factor
1                                                  north        <NA>             
7119                                               south        <NA>             
11                     v025           v025         rural        <NA>       factor
1048                   v025           v025         urban        <NA>       factor
district_name district_name  district_name    Open ended                character
coords_x1         coords_x1      coords_x1    Open ended                  numeric
coords_x2         coords_x2      coords_x2    Open ended                  numeric
itn_color         itn_color      itn_color    Open ended                  numeric
piped                 piped          piped    Open ended                  numeric
sanit                 sanit          sanit    Open ended                  numeric
sanit_cd           sanit_cd       sanit_cd    Open ended                  numeric
water                 water          water    Open ended                  numeric

I decided to take a crack at this for my own amusement. I used the built-in Titanic data set. I had an issue with one of your definitions, though: you say "If there are no unique values, it is considered open ended". But every variable of length >0 has some unique values: did you mean "if every value is unique"? Even this definition doesn't necessarily work as expected: in the Titanic data set, the responses are integer, and there happen to be only 22 unique values out of 32 total values. I didn't think that one would really want this to be enumerated, so I tested for type of factor instead (but you could substitute the length(u)==length(x) line below if you really want).

## utility function: pad vector with blanks to specified length
pad <- function(x,n,p="") {
    return(c(x,rep(p,n-length(x))))
}
## process a single column
proc_col <- function(x,nm) {
    u <- unique(x)
    ## if (length(u)==length(x)) {
    if (!is.factor(x)) {
        n <- 1
        u <- "open ended"
        cc <- ""
    } else {
        cc <- as.numeric(u)
        n <- length(u)
    }
    dd <- data.frame(`Variable Name`=pad(nm,n),
               `Variable Label`=pad(nm,n),
               `Answer Label`=u,
               `Answer Code`=cc,
               `Variable Type`=pad(class(x),n),
               stringsAsFactors=FALSE)
    return(dd)
}
## process all columns
proc_df <- function(x) {
    L <- Map(proc_col,x,names(x))
    dd <- do.call(rbind,L)
    rownames(dd) <- NULL
    return(dd)
}

Example:

xx <- as.data.frame.table(Titanic)
proc_df(xx)

##    Variable.Name Variable.Label Answer.Label Answer.Code Variable.Type
## 1          Class          Class          1st           1       factor
## 2                                        2nd           2              
## 3                                        3rd           3              
## 4                                       Crew           4              
## 5            Sex            Sex         Male           1       factor
## 6                                     Female           2              
## 7            Age            Age        Child           1       factor
## 8                                      Adult           2              
## 9       Survived       Survived           No           1       factor
## 10                                       Yes           2              
## 11          Freq           Freq   open ended                  numeric

I didn't leave blank spaces before the lists of code values etc., but you can make those adjustments yourself ...

Here is my crack at a solution:

CreateCodebook <- function(dF){
  numbercols <- length(colnames(dF))

  table <- data.frame()

  for (i in 1:length(colnames(dF))){
    AnswerCode <- if (sapply(dF, is.factor)[i]) 1:nrow(unique(dF[i])) else ""
    AnswerLabel <- if (sapply(dF, is.factor)[i]) unique(dF[order(dF[i]),][i]) else "Open ended"
    VariableName <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
                                                  rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableLabel <- if (length(AnswerCode) > 1) c(colnames(dF)[i],
                                                   rep("",length(AnswerCode) - 1)) else colnames(dF)[i]
    VariableType <- if (length(AnswerCode) > 1) c(sapply(dF, class)[i],
                                                  rep("",length(AnswerCode) - 1)) else sapply(dF, class)[i]

    df = data.frame(VariableName, VariableLabel, AnswerLabel, AnswerCode, VariableType, stringsAsFactors = FALSE)
    names(df) <- c("Variable Name", "Variable Label", "Variable Type", "Answer Code", "Answer Label")
    table <- rbind(table, df)

  }
  rownames(table) <- 1:nrow(table)
  return(table)
}

Output:

   Variable Name Variable Label Variable Type Answer Code Answer Label
1           brid           brid    Open ended                character
2          month          month    Open ended                  integer
3           year           year    Open ended                  integer
4       org_unit       org_unit    Open ended                character
5           v000           v000    Open ended                character
6           v001           v001    Open ended                  integer
7           v002           v002    Open ended                  integer
8           v003           v003    Open ended                  integer
9           v005           v005    Open ended                  integer
10          v006           v006    Open ended                  integer
11          v007           v007    Open ended                  integer
12          v021           v021    Open ended                  numeric
13          v024           v024       central           1       factor
14                                      north           2             
15                                      south           3             
16          v025           v025         rural           1       factor
17                                      urban           2             
18          bidx           bidx    Open ended                  integer
19 district_name  district_name    Open ended                character
20     coords_x1      coords_x1    Open ended                  numeric
21     coords_x2      coords_x2    Open ended                  numeric
22          anc4           anc4    Open ended                  numeric
23    antimal_48     antimal_48    Open ended                  numeric
24         carep          carep    Open ended                  numeric
25          csec           csec    Open ended                  numeric
26          dptv           dptv    Open ended                  numeric
27       ebreast        ebreast    Open ended                  numeric
28       fans_48        fans_48    Open ended                  numeric
29        ideliv         ideliv    Open ended                  numeric
30          iptp           iptp    Open ended                  numeric
31        iron90         iron90    Open ended                  numeric
32      measlesv       measlesv    Open ended                  numeric
33           ors            ors    Open ended                  numeric
34           ort            ort    Open ended                  numeric
35         pncwm          pncwm    Open ended                  numeric
36       sstools        sstools    Open ended                  numeric
37            tt             tt    Open ended                  numeric
38          vita           vita    Open ended                  numeric

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM