[英]Is there a more efficient way to assign categories in R than using a loop?
I am currently working with a large dataset (two subsamples of sizes 2,008,384x15 and 3,072,435x15 respectively).我目前正在处理一个大型数据集(两个子样本的大小分别为 2,008,384x15 和 3,072,435x15)。 The data are monthly food and beverage price quotes over a period of ca 200 months.
这些数据是大约 200 个月内的月度食品和饮料报价。 I would greatly appreciate your help with the following question:
非常感谢您对以下问题的帮助:
In the original data, each observation (price quote of a product) has a category index from 1 to 151. I want to assign broader categories to these, ie grouping the 151 categories under 19 super-categories.在原始数据中,每个观察值(产品报价)都有一个从 1 到 151 的类别索引。我想为这些分配更广泛的类别,即将 151 个类别分组到 19 个超类别下。 Unfortunately, the breaks are not straight-forward, ie one category, say cheese, includes the indices 57,58,59,60 and 143.
不幸的是,休息不是直接的,即一个类别,比如奶酪,包括索引 57、58、59、60和143。
I tried using a for-loop for the assignment -- which works -- but it took my computer like 2-3 days to run it (inserted below) Therefore, I'd very much like to ask for a more efficient way to assign these categories.我尝试使用 for 循环进行分配——这很有效——但我的计算机需要 2-3 天才能运行它(在下面插入)因此,我非常想寻求一种更有效的分配方式这些类别。
Thank you very much in advance for your help!非常感谢您的帮助! All the best!
一切顺利!
sample_1993_2002$category_all <- NA
sample_1993_2002 = sample_1993_2002[order(sample_1993_2002$category_idx),]
for (i in 1:nrow(sample_1993_2002)){
if (sample_1993_2002$category_idx[i] %in% c(1, 2, 3, 4, 5, 6, 17, 19, 68,69, 70,71,72,74,75,76,77,78,80,81,82,83,84,85,86, 88, 89,90,91, 92, 93, 94, 95, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 127, 141, 151,148, 146,147)) {sample_1993_2002$category_all[i] <- "fruit_veg_cereal"
} else if (sample_1993_2002$category_idx[i] %in% c(9,10,11,12,13,14,123)) {sample_1993_2002$category_all[i] <-"bakery"
} else if (sample_1993_2002$category_idx[i] == 16) {sample_1993_2002$category_all[i] <- "pasta"
} else if (sample_1993_2002$category_idx[i] %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129)) {sample_1993_2002$category_all[i] <- "meat_fresh"
} else if (sample_1993_2002$category_idx[i] %in% c(32,33,34,35,36,37,38,39, 121, 130)) {sample_1993_2002$category_all[i] <- "meat_product"
} else if (sample_1993_2002$category_idx[i] %in% c(40,41,42,43,44,45,46,47,48)) {sample_1993_2002$category_all[i] <- "fish"
} else if (sample_1993_2002$category_idx[i] %in% c(49,50,51,52,53,54,56,61,62,63,111,125)) {sample_1993_2002$category_all[i] <- "animal_prod"
} else if (sample_1993_2002$category_idx[i] %in% c(55,64,65,66,67)) {sample_1993_2002$category_all[i] <- "oils_butter"
} else if (sample_1993_2002$category_idx[i] %in% c(57,58,59,60, 143)){sample_1993_2002$category_all[i] <-"cheese"
} else if (sample_1993_2002$category_idx[i] %in% c(112,113)) {sample_1993_2002$category_all[i] <- "coffee"
} else if (sample_1993_2002$category_idx[i] %in% c(114,115)) {sample_1993_2002$category_all[i] <- "chocolate"
} else if (sample_1993_2002$category_idx[i] %in% c(116,117, 119,150)) {sample_1993_2002$category_all[i] <- "spice"
} else if (sample_1993_2002$category_idx[i] ==124) {sample_1993_2002$category_all[i] <- "jams"
} else if (sample_1993_2002$category_idx[i] == 134) {sample_1993_2002$category_all[i] <- "beer"
} else if (sample_1993_2002$category_idx[i] == 135) {sample_1993_2002$category_all[i] <-"rum"
} else if (sample_1993_2002$category_idx[i]== 136) {sample_1993_2002$category_all[i] <- "brandy"
} else if (sample_1993_2002$category_idx[i]== 137) {sample_1993_2002$category_all[i] <- "wine"
} else if (sample_1993_2002$category_idx[i] == 138) {sample_1993_2002$category_all[i] <- "spirit"
} else {sample_1993_2002$category_all[i] <- "other"}
}
Yes.是的。 Don't use a loop.
不要使用循环。 R is vectorised by default, So, as a proof of concept: using random data:
R 默认情况下是矢量化的,因此,作为概念证明:使用随机数据:
library(tidyverse)
df <- tibble(Category=sample(1:151, size=2e6, replace=TRUE))
df <- df %>%
mutate(
SuperCategory=ifelse(Category %in% c(9,10,11,12,13,14,123), "bakery", NA),
SuperCategory=ifelse(Category %in% c(16), "pasta", SuperCategory),
SuperCategory=ifelse(Category %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129), "meat_fresh", SuperCategory),
SuperCategory=ifelse(Category %in% c(32,33,34,35,36,37,38,39, 121, 130), "meat_product", SuperCategory),
SuperCategory=ifelse(is.na(SuperCategory), "other", SuperCategory)
)
df %>%
group_by(SuperCategory) %>%
summarise(N=n(), .groups="drop")
# A tibble: 5 x 2
SuperCategory N
<chr> <int>
1 bakery 92606
2 meat_fresh 172751
3 meat_product 132752
4 other 1588537
5 pasta 13354
I haven't implemented all your super category definitions, but you get the picture.我还没有实现你所有的超级类别定义,但你明白了。 This runs very quickly:
这运行得非常快:
library(microbenchmark)
microbenchmark({
df <- df %>%
mutate(
SuperCategory=ifelse(Category %in% c(9,10,11,12,13,14,123), "bakery", NA),
SuperCategory=ifelse(Category %in% c(16), "pasta", SuperCategory),
SuperCategory=ifelse(Category %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129), "meat_fresh", SuperCategory),
SuperCategory=ifelse(Category %in% c(32,33,34,35,36,37,38,39, 121, 130), "meat_product", SuperCategory),
SuperCategory=ifelse(is.na(SuperCategory), "other", SuperCategory)
)
},
times=10
)
<output deleted>
Unit: seconds
min lq mean median uq max neval
2.640724 2.655678 2.764513 2.680773 2.901454 3.002492 10
So a mean of 2.7 seconds for 2 million rows.因此,200 万行的平均值为 2.7 秒。 I'm not prepared to run your code to get a direct comparison.
我不准备运行您的代码来进行直接比较。 :=)
:=)
You can also try data.table
, very fast for big tables.您也可以尝试
data.table
,对于大桌子来说非常快。 Here the same Limey's code and data in data.table
way:这里同样是Limey的代码和
data.table
方式的数据:
library(data.table)
set.seed(1234)
dt <- data.frame(Category=sample(1:151, size=2e6, replace=TRUE)
result <-
data.table(dt)[,':='(SuperCategory = fifelse(Category %in% c(9:14, 125),'bakery',
fifelse(Category %in% c(16),'pasta',
fifelse(Category %in% c(20:31, 129),'meat_fresh',
fifelse(Category %in% c(32:39, 130),'meat_product',
fifelse(is.na(Category),'other','x'))))))]
And here the performances for 100 iterations:这里是 100 次迭代的性能:
library(microbenchmark)
library(dplyr)
microbenchmark(
data.table = data.table(dt)[,':='(SuperCategory = fifelse(Category %in% c(9:14, 125),'bakery',
fifelse(Category %in% c(16),'pasta',
fifelse(Category %in% c(20:31, 129),'meat_fresh',
fifelse(Category %in% c(32:39, 130),'meat_product',
fifelse(is.na(Category),'other','x'))))))]
,
tidyverse = dt %>%
tibble() %>%
mutate(
SuperCategory=ifelse(Category %in% c(9,10,11,12,13,14,123), "bakery", NA),
SuperCategory=ifelse(Category %in% c(16), "pasta", SuperCategory),
SuperCategory=ifelse(Category %in% c(20,21,22,23,24,25,26,27,28,29,30,31,129), "meat_fresh", SuperCategory),
SuperCategory=ifelse(Category %in% c(32,33,34,35,36,37,38,39, 121, 130), "meat_product", SuperCategory),
SuperCategory=ifelse(is.na(SuperCategory), "other", SuperCategory)
))
Unit: milliseconds
expr min lq mean median uq max neval
data.table 373.2851 393.7549 437.7108 422.4139 440.835 743.3399 100
tidyverse 3080.3477 3170.4254 3333.6604 3254.2223 3437.303 4192.6353 100
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.