Suppose I have a dataframe:
sick <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
code <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
sick_code <- data.frame(sick, code)
And another:
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
Now i would like to add a new varialbe to p_sick, that "translates" p_sick$sick to sick_code$code. The variable in p_sick$sick is a string which may or may not be p_sick$sick in this case NA should be returned.
Now I could write for loop with a simple ifelse statement. But the data I have is 150million rows long, and the translate table is 15.000 long.
I have googled that this is the equalivalent of a "proc format" in SaS (but I do not have acces to SaS, nor do I have any idea how it works).
Perhaps some variant of merge in plyr, or an apply function?
EDIT: I have accepted both answer, since they work. I will try and look into the difference (in speed) between the two. Since merge is a built in function I am guessing it does lots of checking.
EDIT2: To people getting here by Google; merge has and sort = FALSE which will speed things up. Note that the order is not preserved in any way.
data.table
will be suitable in your example:
library(data.table)
setkey(setDT(p_sick),sick)
p_sick[setDT(sick_code),code := i.code][]
pid sick code
1: 3137 - NA
2: 755 - NA
3: 1327 - NA
4: 929 - NA
5: 939 daa16 cancer
6: 906 SO NA
Please see here for detail explanation.
You could use merge
with all.x = TRUE
(to keep values from p_sick
with no match in sick_code
:
merge(p_sick, sick_code, all.x = TRUE)
An equivalent is using left_join
from dplyr:
library(dplyr)
left_join(p_sick, sick_code)
# pid sick code
# 1 212 - <NA>
# 2 2366 - <NA>
# 3 325 - <NA>
# 4 269 - <NA>
# 5 501 daa16 cancer
# 6 1352 SO <NA>
Note that each of these solutions works only because the name sick
is shared between the two data frames. Suppose they had different names- say the column was called sickness
in sick_code
. You could accommodate this with, respectively:
merge(p_sick, sick_code, by.x = "sick", by.y = "sickness", all.x = TRUE)
# or
left_join(p_sick, sick_code, c(sick = "sickness"))
A simple named vector will also work. The named vector can act as a lookup. So instead of defining sick and code as a data frame, define it as a named vector and use it as a decode. Like this:
# Set up named vector
sick_decode <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
names(sick_decode) <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
# Prepare data
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
# Create new variable using decode
p_sick$sick_decode <- sick_decode[p_sick$sick]
# Results
#> pid sick sick_decode
#> 1 511 - <NA>
#> 2 1619 - <NA>
#> 3 394 - <NA>
#> 4 641 - <NA>
#> 5 53 daa16 cancer
#> 6 244 SO <NA>
I suspect this method will also be fast, but have not benchmarked it.
Also, there is now an R package specifically for replicating SAS format functionality in R. It is called fmtr .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.