簡體   English   中英

如何根據data.frame中的變量引用列表?

[英]How can I reference a list based on a variable within a data.frame?

我有一個帶有emp_idjob_code的簡單表。 我想根據job_code返回正確的payout

我用嵌套的ifelse管理了這個,但是如果我有更多的job_code呢?

library(dplyr)
set.seed(1)

emp_id   <- round(rnorm(100, 500000, 10000))
job_code <- sample(c('a', 'b', 'c'), 100, replace = TRUE)
result   <- sample(c(1,2,3,4), 100, replace = TRUE)

df <- data.frame(emp_id = emp_id, job_code = job_code, result = result)

job_a <- c(0, 500, 1000, 5000)
job_b <- c(0, 200, 500, 750)
job_c <- c(0, 250, 750, 1000)

# Works but sucky
df %>% mutate(payout = ifelse(job_code == 'a', job_a[result],
  ifelse(job_code == 'b', job_b[result],
    job_c[result])))

dput如果你喜歡:

structure(list(emp_id = c(493735, 501836, 491644, 515953, 503295, 
491795, 504874, 507383, 505758, 496946, 515118, 503898, 493788, 
477853, 511249, 499551, 499838, 509438, 508212, 505939, 509190, 
507821, 500746, 480106, 506198, 499439, 498442, 485292, 495218, 
504179, 513587, 498972, 503877, 499462, 486229, 495850, 496057, 
499407, 511000, 507632, 498355, 497466, 506970, 505567, 493112, 
492925, 503646, 507685, 498877, 508811, 503981, 493880, 503411, 
488706, 514330, 519804, 496328, 489559, 505697, 498649, 524016, 
499608, 506897, 500280, 492567, 501888, 481950, 514656, 501533, 
521726, 504755, 492901, 506107, 490659, 487464, 502914, 495567, 
500011, 500743, 494105, 494313, 498648, 511781, 484764, 505939, 
503330, 510631, 496958, 503700, 502671, 494575, 512079, 511604, 
507002, 515868, 505585, 487234, 494267, 487754, 495266), job_code = structure(c(1L, 
1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 3L, 3L, 1L, 3L, 3L, 3L, 1L, 2L, 
3L, 3L, 2L, 1L, 1L, 1L, 2L, 3L, 2L, 1L, 1L, 2L, 3L, 2L, 1L, 2L, 
2L, 2L, 3L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 
3L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 3L, 2L, 1L, 1L, 3L, 3L, 
1L, 1L, 3L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 
2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 
1L, 1L, 3L), .Label = c("a", "b", "c"), class = "factor"), result = c(3, 
1, 2, 2, 2, 4, 1, 4, 1, 2, 1, 1, 4, 3, 2, 2, 1, 2, 4, 3, 3, 2, 
2, 4, 4, 4, 4, 4, 2, 4, 4, 2, 2, 4, 1, 2, 2, 1, 3, 4, 4, 1, 3, 
2, 3, 2, 2, 1, 2, 3, 2, 1, 4, 2, 4, 2, 4, 1, 4, 2, 1, 2, 4, 2, 
3, 4, 1, 3, 3, 2, 2, 3, 4, 1, 1, 2, 2, 4, 1, 2, 2, 3, 3, 4, 1, 
1, 4, 4, 1, 4, 1, 1, 4, 3, 1, 2, 3, 2, 2, 1)), .Names = c("emp_id", 
"job_code", "result"), row.names = c(NA, -100L), class = "data.frame")

我理想的做法是在data.frame中獲得支出,但不確定如何正確引用它:

job_payouts <- data.frame(a = job_a, b = job_b, c = job_c)
# Won't work...
df %>% mutate(payout = job_payouts$job_code[result])

這可以通過基礎R中的矩陣索引的超酷方法來實現,這非常快速和有效。

# build jobs payout lookup matrix, by hand (see edit below for an extension)
jobs <- rbind(job_a, job_b, job_c)

# add row names to the matrix for convenient reference
rownames(jobs) <- levels(df$job_code)

# get payout using matrix indexing
df$payout <- jobs[cbind(df$job_code, df$result)]

這回來了

# print out first 6 observations
head(df)
  emp_id job_code result payout
1 493735        a      3   1000
2 501836        a      1      0
3 491644        b      2    200
4 515953        a      2    500
5 503295        a      2    500
6 491795        b      4    750

# print out jobs matrix for comparison
jobs
  [,1] [,2] [,3] [,4]
a    0  500 1000 5000
b    0  200  500  750
c    0  250  750 1000

有一些值得一提的細節。

  1. data.frame函數轉換job_code字符向量,因此df$job_code是一個因子變量,其中標簽與自然數1,2,3,...相關聯。默認情況下,因子的級別按標簽按字母順序排序,因此在此示例中,標簽“a”對應於1,“b”對應於2,“c”對應於3.您可以使用levels函數查找因子變量的順序,並在該模板之后構造作業矩陣。
  2. 作業矩陣用作查找表。 它被構造成使得這些整數指的是作業矩陣的行號。 然后,列可以與原始支付向量一樣是子集。
  3. cbind(df$job_code, df$result)形成一個2乘nrow(df) (100)矩陣,用於使用矩陣索引從作業矩陣中查找每個雇員的nrow(df)支付值。 R intro手冊有一個關於矩陣索引的很好的介紹部分,其他細節可以在help("[")

編輯: 自動構建查找矩陣

在對這個答案的評論中,OP表示擔心手工構建查找矩陣(我稱之為“作業”)將是乏味的並且容易出錯。 為了解決這些有效的問題,我們可以對mget函數使用一個有點模糊的參數,“ifnotfound”。 這個參數允許我們控制mget返回的列表元素的輸出,當它們出現在名稱向量中時,但不存在於環境中。

在評論中,我建議使用NA填寫下面評論中的缺失級別。 我們可以通過使用NA作為“ifnotfound”的輸入來擴展它。

假設df$job_codedf$job_code具有級別“a”,“aa”,“b”和“c”的因子。 然后我們構建查找矩陣如下:

# build vector for example, the actual code, using levels(), follows as a comment
job_codes <- c("a", "aa", "b", "c") # job_codes <- levels(df$jobcodes)

# get ordered list of payouts, with NA for missing payouts
payoutList <- mget(paste0("job_", job_codes), ifnotfound=NA)

它返回一個命名列表。

payoutList
$job_a
[1]    0  500 1000 5000

$job_aa
[1] NA

$job_b
[1]   0 200 500 750

$job_c
[1]    0  250  750 1000

請注意, payoutList$job_aa是一個NA。 現在,從此列表構建矩陣。

# build lookup matrix using do.call() and rbind()
jobs.lookupMat <- do.call(rbind, payoutList)

jobs.lookupMat
       [,1] [,2] [,3] [,4]
job_a     0  500 1000 5000
job_aa   NA   NA   NA   NA
job_b     0  200  500  750
job_c     0  250  750 1000

矩陣的行根據因子df$job_code的級別正確排序,方便地命名,並且NA在任何沒有支付的地方填充行。

在不更改數據結構的情況下,可以通過定義函數來實現:

job_search <- function(code){
  var_name <- paste0("job_",code)
  if (exists(var_name)){
    return(get(var_name))
  }else{
    return(NA)
  }
}

library(data.table)
setDT(df)
df[, payout := job_search(job_code)[result], by = .(emp_id)]
df
        emp_id job_code result payout
  1: 493735        a      3   1000
  2: 501836        a      1      0
  3: 491644        b      2    200
  4: 515953        a      2    500
  5: 503295        a      2    500
  6: 491795        b      4    750
  7: 504874        b      1      0
  8: 507383        a      4   5000
  9: 505758        a      1      0
 10: 496946        c      2    250
 11: 515118        c      1      0
 12: 503898        a      1      0
 ...

但是,這是保存數據的一種相當不穩定的方法,而且粘貼+獲取語法是錯綜復雜的。

存儲數據的更好方法是在查找表中:

library(data.table)

job_a <- data.frame(payout = c(0, 500, 1000, 5000))
job_b <- data.frame(payout = c(0, 200, 500, 750))
job_c <- data.frame(payout = c(0, 250, 750, 1000))

job_lookup <- rbindlist( #this is a data.table
  l = list(a = job_a,b = job_b,c = job_c),
  idcol = TRUE
)

# create your result index
job_lookup[, result := 1:.N, by = .id]
job_lookup
    .id payout result
 1:   a      0     1
 2:   a    500     2
 3:   a   1000     3
 4:   a   5000     4
 5:   b      0     1
 6:   b    200     2
 7:   b    500     3
 8:   b    750     4
 9:   c      0     1
10:   c    250     2
11:   c    750     3
12:   c   1000     4

# merge to your initial data.frame
merge(df, job_lookup, by.x = c("job_code","result"), by.y = c(".id","result"), all.x = TRUE)

    job_code result emp_id payout
1          a      1 505758      0
2          a      1 501836      0
3          a      1 503898      0
4          a      1 494575      0
5          a      1 487464      0
6          a      1 503700      0
7          a      1 505939      0
8          a      1 503330      0
9          a      1 512079      0
10         a      1 481950      0
11         a      1 507685      0
12         a      1 490659      0
...        

使用tidyverse的工具:

library(dplyr)
library(stringr)
library(tidyr)

# your data
set.seed(1)

emp_id <- round(rnorm(100, 500000, 10000))
job_code <- sample(c('a', 'b', 'c'), 100, replace = TRUE)
result <- sample(c(1,2,3,4), 100, replace = TRUE)

# construct a data frame
df <- 
  data.frame(emp_id = emp_id,
             job_code = job_code, 
             result = result,
             stringsAsFactors = FALSE)

# your jobs
job_a <- c(0, 500, 1000, 5000)
job_b <- c(0, 200, 500, 750)
job_c <- c(0, 250, 750, 1000)

# construct a data frame
my_job <- 
  data.frame(job_a, job_b, job_c) %>% 
  gather(job, value) %>% 
  group_by(job) %>% 
  mutate(result = 1:n(),
         job_code = str_replace(job, "job_", "")) %>% 
  ungroup %>% 
  select(-job)

# join df and my_job into my_results table
my_results <-
  left_join(df, my_job)

結果

my_results %>% tbl_df

Source: local data frame [100 x 4]

   emp_id job_code result value
    (dbl)    (chr)  (dbl) (dbl)
1  493735        a      3  1000
2  501836        a      1     0
3  491644        b      2   200
4  515953        a      2   500
5  503295        a      2   500
6  491795        b      4   750
7  504874        b      1     0
8  507383        a      4  5000
9  505758        a      1     0
10 496946        c      2   250
..    ...      ...    ...   ...

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM