简体   繁体   中英

Speeding up recoding of a character column in R

I have some data where each data point is associated with a character vector of varying length. For example, it might be generated by the following function:

library(tidyverse)

set.seed(27)

generate_keyset <- function(...) {
  sample(LETTERS[1:5], size = rpois(n = 1, lambda = 10), replace = TRUE)
}

generate_keyset()
#>  [1] "A" "C" "A" "A" "A" "A" "A" "E" "C" "C" "A" "D" "A" "D" "C" "A"

I would like to summarize this keyset by converting it to a single number score. The way this works is straightforward: each key in the keyset has a value, and to get the value of the entire keyset I sum over the values. The key-value map is a tibble with several hundred entries, but you can imagine it looks like:

key_value_map <- tribble(
  ~key, ~value,
  "A",       1,
  "B",      -2,
  "C",       8,
  "D",      -4,
  "E",       0
)

Currently I am scoring keysets with the following function:

score_keyset <- function(keyset) {
  merged_keysets_to_map <- data.frame(
    key = keyset,
    stringsAsFactors = FALSE
  ) %>%
    left_join(key_value_map, by = "key")

  sum(merged_keysets_to_map$value)
}

score_keyset(LETTERS[1:4])
#> [1] 3

This works fine, except it is very slow, and I need to do this operation about a million times. For example, I would like the following to be much faster:

n <- 1e4  # in practice I have n = 1e6

fake_data <- tibble(
  keyset = map(1:n, generate_keyset)
)

library(tictoc)

tic()

scored_data <- fake_data %>%
  mutate(
    value = map_dbl(keyset, score_keyset)
  )

toc()

I am sure this is some much better way to do this with indexing but it is escaping me at the moment. Help speeding this up is much appreciated.

Instead of doing a join and then sum, it would be more efficient if we use a named vector to match

library(tibble)
sum(deframe(key_value_map)[generate_keyset()])

Checking the timings, the OP's tic/toc showed 45.728 sec

tic()
v1 <- deframe(key_value_map)

scored_data2 <- fake_data %>%
  mutate(
    value = map_dbl(keyset, ~ sum(v1[.x]))
  )

toc()
#0.952 sec elapsed

identical(scored_data, scored_data2)
#[1] TRUE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM