[英]R - Increasing runtime for for loop in R for large datasets
考慮以下示例,其中R-中的這2個數據幀
original = data.frame(group = paste("G",c(1:5),sep=""),
field1 = c("A","B","C","D","E"),
cost = round(runif(5,300,500),2),
slno = c("1 4 5 7","1 3","9","2 5 7 10","1 10"), stringsAsFactors = F)
alternative = data.frame(slno = c(1:10),
name = paste("name",c(1:10),sep=""),
cost = round(runif(10,50,100),2), stringsAsFactors = F)
我想執行以下步驟並以原始方式輸入這些列-
將原始數據幀第4列中存在的每個slno (以空格分隔)映射到替代數據幀中,並獲取費用。
從原始成本中減去替代成本的每個成本,然后計算節省額。
original $ max_alternative列應具有最大節省量的替代名稱。 original $ max_saving應該有相應的節省。
original $ oth_alt列應將所有其他名稱以分號分隔。 original $ oth_savings應該用相應的儲蓄分號分隔。
數據集->
> original
group field1 cost slno
1 G1 A 330.37 1 4 5 7
2 G2 B 463.80 1 3
3 G3 C 471.74 9
4 G4 D 465.71 2 5 7 10
5 G5 E 472.83 1 10
> alternative
slno name cost
1 1 name1 64.98
2 2 name2 94.63
3 3 name3 98.96
4 4 name4 68.39
5 5 name5 61.48
6 6 name6 87.46
7 7 name7 75.91
8 8 name8 67.93
9 9 name9 55.29
10 10 name10 93.03
所需的輸出->
> original
group field1 cost slno max_alternative max_saving oth_alt oth_sav
1 G1 A 330.37 1 4 5 7 name5 268.89 name1;name4;name7 265.39;261.98;254.46
2 G2 B 463.80 1 3 name1 398.82 name3 364.84
3 G3 C 471.74 9 name9 416.45
4 G4 D 465.71 2 5 7 10 name5 404.23 name2;name7;name10 371.08;389.80;372.68
5 G5 E 472.83 1 10 name1 407.85 name10 379.80
注意事項:我舉了一個小例子來說明我的問題。 就我而言,我有巨大的數據幀,每個幀有近100萬行。 因此,在這種情況下,for循環效率不高,因為要花幾個小時才能完成。 有沒有有效的方法可以做到這一點?
提前致謝!
使用dplyr和tidyr的解決方案。 original2
是最終輸出。 關鍵是要使用separate_rows
擴大slno
列,執行基於加入slno
之間original
和alternative
,然后用group_by
和summarize
,總結的所有信息。 請注意, which.min
僅返回向量中的第一個最小值。 如果您有多個等於最小值的值,則代碼仍將僅返回第一個最小值。
library(dplyr)
library(tidyr)
original2 <- original %>%
separate_rows(slno, convert = TRUE) %>%
left_join(alternative, by = "slno") %>%
group_by(group, field1) %>%
summarise(cost = first(cost.x),
slno = paste(slno, collapse = " "),
max_alternative = name[which.min(cost.y)],
max_saving = first(cost.x) - cost.y[which.min(cost.y)],
oth_alt = paste(name[-which.min(cost.y)], collapse = ";"),
oth_sav = paste(first(cost.x) - cost.y[-which.min(cost.y)], collapse = ";")) %>%
ungroup() %>%
as.data.frame(stringsAsFactors = FALSE)
original2
# group field1 cost slno max_alternative max_saving oth_alt oth_sav
# 1 G1 A 330.37 1 4 5 7 name5 268.89 name1;name4;name7 265.39;261.98;254.46
# 2 G2 B 463.80 1 3 name1 398.82 name3 364.84
# 3 G3 C 471.74 9 name9 416.45
# 4 G4 D 465.71 2 5 7 10 name5 404.23 name2;name7;name10 371.08;389.8;372.68
# 5 G5 E 472.83 1 10 name1 407.85 name10 379.8
這是使用data.table和splitstackshape的替代方法。 的cSplit
功能,如separate_rows
,也可以擴大該數據幀。
library(data.table)
library(splitstackshape)
setDT(alternative)
original2 <- cSplit(original, "slno", direction = "long", sep = " ")
original3 <- merge(original2, alternative, by = "slno", all.x = TRUE)
original4 <- original3[, .(cost = first(cost.x),
slno = paste(slno, collapse = " "),
max_alternative = name[which.min(cost.y)],
max_saving = first(cost.x) - cost.y[which.min(cost.y)],
oth_alt = paste(name[-which.min(cost.y)], collapse = ";"),
oth_sav = paste(first(cost.x) - cost.y[-which.min(cost.y)], collapse = ";")),
by = .(group, field1)][order(group)]
original4[]
# group field1 cost slno max_alternative max_saving oth_alt oth_sav
# 1: G1 A 330.37 1 4 5 7 name5 268.89 name1;name4;name7 265.39;261.98;254.46
# 2: G2 B 463.80 1 3 name1 398.82 name3 364.84
# 3: G3 C 471.74 9 name9 416.45
# 4: G4 D 465.71 2 5 7 10 name5 404.23 name2;name7;name10 371.08;389.8;372.68
# 5: G5 E 472.83 1 10 name1 407.85 name10 379.8
績效評估
正如OP提到的那樣,性能可能是一個問題。 在這里,我使用了microbenchmark程序包和以下代碼來查看哪個速度更快。 m1
是dplyr
方法,而m2
是data.table
方法。
library(microbenchmark)
# Create data.table object
alternative_dt <- as.data.table(alternative)
original_dt <- as.data.table(original)
# Evaluate performance
microbenchmark(m1 = {
original2 <- original %>%
separate_rows(slno, convert = TRUE) %>%
left_join(alternative, by = "slno") %>%
group_by(group, field1) %>%
summarise(cost = first(cost.x),
slno = paste(slno, collapse = " "),
max_alternative = name[which.min(cost.y)],
max_saving = first(cost.x) - cost.y[which.min(cost.y)],
oth_alt = paste(name[-which.min(cost.y)], collapse = ";"),
oth_sav = paste(first(cost.x) - cost.y[-which.min(cost.y)], collapse = ";")) %>%
ungroup()},
m2 = {original2 <- cSplit(original_dt, "slno", direction = "long", sep = " ")
original3 <- merge(original2, alternative, by = "slno", all.x = TRUE)
original4 <- original3[, .(cost = first(cost.x),
slno = paste(slno, collapse = " "),
max_alternative = name[which.min(cost.y)],
max_saving = first(cost.x) - cost.y[which.min(cost.y)],
oth_alt = paste(name[-which.min(cost.y)], collapse = ";"),
oth_sav = paste(first(cost.x) - cost.y[-which.min(cost.y)], collapse = ";")),
by = .(group, field1)][order(group)]})
# Unit: milliseconds
# expr min lq mean median uq max neval
# m1 21.106662 22.673250 23.978065 23.519644 25.005269 33.26359 100
# m2 3.886784 4.418318 4.730305 4.702078 4.970674 7.61164 100
結果表明data.table
比dplyr
快。 如果OP正在處理大量數據。 data.table
可能是第一選擇。 但是,盡管我沒有開發for循環方法並測試性能,但data.table
和dplyr
方法都可能比for循環快很多。
數據
original <- read.table(text = "group field1 cost slno
1 G1 A 330.37 '1 4 5 7'
2 G2 B 463.80 '1 3'
3 G3 C 471.74 '9'
4 G4 D 465.71 '2 5 7 10'
5 G5 E 472.83 '1 10'",
header = TRUE, stringsAsFactors = FALSE)
alternative <- read.table(text = " slno name cost
1 1 name1 64.98
2 2 name2 94.63
3 3 name3 98.96
4 4 name4 68.39
5 5 name5 61.48
6 6 name6 87.46
7 7 name7 75.91
8 8 name8 67.93
9 9 name9 55.29
10 10 name10 93.03",
header = TRUE, stringsAsFactors = FALSE)
一種選擇是使用sqldf
和dplyr
。 為了保持清晰度,分步驟顯示了解決方案。
#The data
library(sqldf)
library(dplyr)
original = data.frame(group = paste("G",c(1:5),sep=""),
field1 = c("A","B","C","D","E"),
cost = round(runif(5,300,500),2),
slno = c("1 4 5 7","1 3","9","2 5 7 10","1 10"), stringsAsFactors = F)
alternative = data.frame(slno = c(1:10),
name = paste("name",c(1:10),sep=""),
cost = round(runif(10,50,100),2), stringsAsFactors = F)
#> original
# group field1 cost slno
#1 G1 A 490.71 1 4 5 7
#2 G2 B 399.20 1 3
#3 G3 C 326.40 9
#4 G4 D 421.69 2 5 7 10
#5 G5 E 498.37 1 10
#> alternative
# slno name cost
#1 1 name1 54.74
#2 2 name2 94.76
#3 3 name3 66.74
#4 4 name4 73.61
#5 5 name5 58.86
#6 6 name6 67.58
#7 7 name7 58.83
#8 8 name8 82.65
#9 9 name9 61.81
#10 10 name10 94.86
#join both data.frames
join_qury <- "select original.*, alternative.name as alternative, (original.cost - alternative.cost) as saving from original
inner join alternative where original.slno like '%' || alternative.slno || '%'"
df <- sqldf(join_qury,stringsAsFactors = FALSE)
#> df
# group field1 cost slno alternative saving
#1 G1 A 490.71 1 4 5 7 name1 435.97
#2 G1 A 490.71 1 4 5 7 name4 417.10
#3 G1 A 490.71 1 4 5 7 name5 431.85
#4 G1 A 490.71 1 4 5 7 name7 431.88
#5 G2 B 399.20 1 3 name1 344.46
#6 G2 B 399.20 1 3 name3 332.46
#7 G3 C 326.40 9 name9 264.59
#8 G4 D 421.69 2 5 7 10 name1 366.95
#9 G4 D 421.69 2 5 7 10 name2 326.93
#10 G4 D 421.69 2 5 7 10 name5 362.83
#11 G4 D 421.69 2 5 7 10 name7 362.86
#12 G4 D 421.69 2 5 7 10 name10 326.83
#13 G5 E 498.37 1 10 name1 443.63
#14 G5 E 498.37 1 10 name10 403.51
# Filter data to contain only max value for a name
df_maxval <- df %>%
group_by(group,field1, cost, slno) %>%
filter(saving == max(saving))
#Find and group other name and savings
df_other <- setdiff(df, df_maxval) %>%
group_by(group,field1, cost, slno) %>%
summarise_at(.vars = vars(alternative, saving),
.funs = c("toString")) %>%
ungroup()
# finally join both max savings and other values
df_final <- df_maxval %>%
inner_join(df_other, by = c("group", "field1", "slno")) %>%
select(group, field1, cost = cost.x, slno, max_alternative = alternative.x,
max_saving = saving.x, oth_alt = alternative.y, oth_sav = saving.y)
#Result
#> df_final
# A tibble: 4 x 8
# Groups: group, field1, cost, slno [4]
# group field1 cost slno max_alternative max_saving oth_alt oth_sav
# <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <chr>
#1 G1 A 490.71 1 4 5 7 name1 435.97 name4, name5, name7 417.1, 431.85, 431.88
#2 G2 B 399.20 1 3 name1 344.46 name3 332.46
#3 G4 D 421.69 2 5 7 10 name1 366.95 name2, name5, name7, name10 326.93, 362.83, 362.86, 326.83
#4 G5 E 498.37 1 10 name1 443.63 name10 403.51
這是使用基數R的試用版:
transform(df1,e=t(mapply(function(x,y){
v=df2[x,][s<-which.min(df2$cost[x]),-1];
w=df2[x,][-s,-1]
cbind( c(v[1],y-v[2]),
apply(cbind(w[,1],y-w[,2]),2,paste0,collapse=";"))},
lapply(strsplit(df1$slno," "),as.numeric),df1$cost)))
group field1 cost slno e.1 e.2 e.3 e.4
1 G1 A 330.37 1 4 5 7 name5 268.89 name1;name4;name7 265.39;261.98;254.46
2 G2 B 463.80 1 3 name1 398.82 name3 364.84
3 G3 C 471.74 9 name9 416.45
4 G4 D 465.71 2 5 7 10 name5 404.23 name2;name7;name10 371.08;389.8;372.68
5 G5 E 472.83 1 10 name1 407.85 name10 379.8
>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.