简体   繁体   English

在 R 中分配类别是否有比使用循环更有效的方法?

[英]Is there a more efficient way to assign categories in R than using a loop?

I am currently working with a large dataset (two subsamples of sizes 2,008,384x15 and 3,072,435x15 respectively).我目前正在处理一个大型数据集(两个子样本的大小分别为 2,008,384x15 和 3,072,435x15)。 The data are monthly food and beverage price quotes over a period of ca 200 months.这些数据是大约 200 个月内的月度食品和饮料报价。 I would greatly appreciate your help with the following question:非常感谢您对以下问题的帮助:

In the original data, each observation (price quote of a product) has a category index from 1 to 151. I want to assign broader categories to these, ie grouping the 151 categories under 19 super-categories.在原始数据中,每个观察值(产品报价)都有一个从 1 到 151 的类别索引。我想为这些分配更广泛的类别,即将 151 个类别分组到 19 个超类别下。 Unfortunately, the breaks are not straight-forward, ie one category, say cheese, includes the indices 57,58,59,60 and 143.不幸的是,休息不是直接的,即一个类别,比如奶酪,包括索引 57、58、59、60143。

I tried using a for-loop for the assignment -- which works -- but it took my computer like 2-3 days to run it (inserted below) Therefore, I'd very much like to ask for a more efficient way to assign these categories.我尝试使用 for 循环进行分配——这很有效——但我的计算机需要 2-3 天才能运行它(在下面插入)因此,我非常想寻求一种更有效的分配方式这些类别。

Thank you very much in advance for your help!非常感谢您的帮助! All the best!一切顺利!

sample_1993_2002$category_all <- NA
sample_1993_2002 = sample_1993_2002[order(sample_1993_2002$category_idx),]

for (i in 1:nrow(sample_1993_2002)){
  
  if (sample_1993_2002$category_idx[i] %in% c(1, 2, 3, 4, 5, 6, 17, 19, 68,69, 70,71,72,74,75,76,77,78,80,81,82,83,84,85,86, 88, 89,90,91, 92, 93, 94, 95, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 127, 141, 151,148, 146,147)) {sample_1993_2002$category_all[i] <- "fruit_veg_cereal"
  } else if (sample_1993_2002$category_idx[i] %in% c(9,10,11,12,13,14,123)) {sample_1993_2002$category_all[i] <-"bakery"
  } else if (sample_1993_2002$category_idx[i] == 16) {sample_1993_2002$category_all[i] <- "pasta" 
  } else if (sample_1993_2002$category_idx[i] %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129)) {sample_1993_2002$category_all[i] <- "meat_fresh"
  } else if (sample_1993_2002$category_idx[i] %in% c(32,33,34,35,36,37,38,39, 121, 130)) {sample_1993_2002$category_all[i] <- "meat_product"
  } else if (sample_1993_2002$category_idx[i] %in% c(40,41,42,43,44,45,46,47,48)) {sample_1993_2002$category_all[i] <- "fish"
  } else if (sample_1993_2002$category_idx[i] %in% c(49,50,51,52,53,54,56,61,62,63,111,125)) {sample_1993_2002$category_all[i] <- "animal_prod"
  } else if (sample_1993_2002$category_idx[i] %in% c(55,64,65,66,67)) {sample_1993_2002$category_all[i] <-  "oils_butter"
  } else if (sample_1993_2002$category_idx[i] %in% c(57,58,59,60, 143)){sample_1993_2002$category_all[i] <-"cheese" 
  } else if (sample_1993_2002$category_idx[i] %in% c(112,113)) {sample_1993_2002$category_all[i] <- "coffee" 
  } else if (sample_1993_2002$category_idx[i] %in% c(114,115)) {sample_1993_2002$category_all[i] <- "chocolate" 
  } else if (sample_1993_2002$category_idx[i] %in% c(116,117, 119,150)) {sample_1993_2002$category_all[i] <- "spice"
  } else if (sample_1993_2002$category_idx[i] ==124) {sample_1993_2002$category_all[i] <- "jams"
  } else if (sample_1993_2002$category_idx[i] == 134) {sample_1993_2002$category_all[i] <- "beer"
  } else if (sample_1993_2002$category_idx[i] == 135) {sample_1993_2002$category_all[i] <-"rum"
  } else if (sample_1993_2002$category_idx[i]== 136) {sample_1993_2002$category_all[i] <- "brandy"
  } else if (sample_1993_2002$category_idx[i]== 137) {sample_1993_2002$category_all[i] <- "wine"
  } else if (sample_1993_2002$category_idx[i] == 138) {sample_1993_2002$category_all[i] <- "spirit"
  } else {sample_1993_2002$category_all[i] <- "other"}
}

Yes.是的。 Don't use a loop.不要使用循环。 R is vectorised by default, So, as a proof of concept: using random data: R 默认情况下是矢量化的,因此,作为概念证明:使用随机数据:

library(tidyverse)

df <- tibble(Category=sample(1:151, size=2e6, replace=TRUE))

df <- df %>% 
       mutate(
         SuperCategory=ifelse(Category %in% c(9,10,11,12,13,14,123), "bakery", NA),
         SuperCategory=ifelse(Category %in% c(16), "pasta", SuperCategory),
         SuperCategory=ifelse(Category %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129), "meat_fresh", SuperCategory),
         SuperCategory=ifelse(Category %in% c(32,33,34,35,36,37,38,39, 121, 130), "meat_product", SuperCategory),
         SuperCategory=ifelse(is.na(SuperCategory), "other", SuperCategory)
      )
df %>% 
  group_by(SuperCategory) %>% 
  summarise(N=n(), .groups="drop")
# A tibble: 5 x 2
  SuperCategory       N
  <chr>           <int>
1 bakery          92606
2 meat_fresh     172751
3 meat_product   132752
4 other         1588537
5 pasta           13354

I haven't implemented all your super category definitions, but you get the picture.我还没有实现你所有的超级类别定义,但你明白了。 This runs very quickly:这运行得非常快:

library(microbenchmark)

microbenchmark({
df <- df %>% 
       mutate(
         SuperCategory=ifelse(Category %in% c(9,10,11,12,13,14,123), "bakery", NA),
         SuperCategory=ifelse(Category %in% c(16), "pasta", SuperCategory),
         SuperCategory=ifelse(Category %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129), "meat_fresh", SuperCategory),
         SuperCategory=ifelse(Category %in% c(32,33,34,35,36,37,38,39, 121, 130), "meat_product", SuperCategory),
         SuperCategory=ifelse(is.na(SuperCategory), "other", SuperCategory)
      )
 },
 times=10
)

<output deleted>

Unit: seconds
                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
      min       lq     mean   median       uq      max neval
 2.640724 2.655678 2.764513 2.680773 2.901454 3.002492    10

So a mean of 2.7 seconds for 2 million rows.因此,200 万行的平均值为 2.7 秒。 I'm not prepared to run your code to get a direct comparison.我不准备运行您的代码来进行直接比较。 :=) :=)

You can also try data.table , very fast for big tables.您也可以尝试data.table ,对于大桌子来说非常快。 Here the same Limey's code and data in data.table way:这里同样是Limey的代码和data.table方式的数据:

library(data.table)
set.seed(1234)
dt <- data.frame(Category=sample(1:151, size=2e6, replace=TRUE)

result <-
 data.table(dt)[,':='(SuperCategory = fifelse(Category %in% c(9:14, 125),'bakery',
                                       fifelse(Category %in% c(16),'pasta',
                                         fifelse(Category %in% c(20:31, 129),'meat_fresh',
                                           fifelse(Category %in% c(32:39, 130),'meat_product',
                                             fifelse(is.na(Category),'other','x'))))))]

And here the performances for 100 iterations:这里是 100 次迭代的性能:

library(microbenchmark)
library(dplyr)

microbenchmark(
data.table =  data.table(dt)[,':='(SuperCategory = fifelse(Category %in% c(9:14, 125),'bakery',
                                                  fifelse(Category %in% c(16),'pasta',
                                                  fifelse(Category %in% c(20:31, 129),'meat_fresh',
                                                  fifelse(Category %in% c(32:39, 130),'meat_product',
                                                  fifelse(is.na(Category),'other','x'))))))]

,

tidyverse = dt %>% 
 tibble() %>%
 mutate(
  SuperCategory=ifelse(Category %in% c(9,10,11,12,13,14,123), "bakery", NA),
  SuperCategory=ifelse(Category %in% c(16), "pasta", SuperCategory),
  SuperCategory=ifelse(Category %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129), "meat_fresh", SuperCategory),
  SuperCategory=ifelse(Category %in% c(32,33,34,35,36,37,38,39, 121, 130), "meat_product", SuperCategory),
  SuperCategory=ifelse(is.na(SuperCategory), "other", SuperCategory)
 ))

Unit: milliseconds
       expr       min        lq      mean    median       uq       max neval
 data.table  373.2851  393.7549  437.7108  422.4139  440.835  743.3399   100
  tidyverse 3080.3477 3170.4254 3333.6604 3254.2223 3437.303 4192.6353   100

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 比R中的for循环更有效 - More efficient than for loop in R R:限制排列比使用for循环更有效 - R: Restricted permutations more efficient way than using for loops R:比这个for-loop更有效的解决方案 - R: more efficient solution than this for-loop 寻找一种比在R中循环更有效的方法来创建数据帧 - looking for a more efficient way to create a dataframe than looping in R 是否有一种更有效的方法来填充额外的列而不是&#39;for&#39;循环? - Is there a more efficient way to fill extra column than a 'for' loop? 比strsplit()在R中将字符串拆分为两个更有效的内存方式 - More memory efficient way than strsplit() to split a string into two in R 如何在不使用循环的情况下或更有效地计算R中的“连续平均值”? - how to calculate “consecutive mean” in R without using loop, or in a more efficient way? 有没有一种方法可以向量化R中的这个foreach循环,以使文本替换更有效? - Is there a way to vectorize this foreach loop in R to make text replacement more efficient? R 用更有效的方式替换嵌套循环来遍历 2 个数据帧 - R replace nested loop with a more efficient way of going over 2 dataframes 此R循环的更有效版本 - More efficient version of this R loop
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM