简体   繁体   English

R中的有效匹配/查找

[英]Efficient Match/Lookup in R

Starting with 2 objects: 1 data frame of order attributes - Order Numbers, Weights and Volumes, and 1 list - combination strings of Order Numbers. 从2个对象开始:1个订单属性数据框-订单号,重量和体积,以及1个列表-订单号的组合字符串。

attr <- data.frame(Order.No = c(111,222,333), Weight = c(20,75,50), Volume = c(10,30,25))
combn <- list(111, 222, 333, c(111,222), c(111,333), c(222,333), c(111,222,333))

The objective is to find the total weight and cube for each string of orders, and keep only the combinations that are within both the weight and cube constraints. 目的是找到每个订单字符串的总权重和立方,并仅保留在权重和立方约束内的组合。

I'm currently using the following - 我目前正在使用以下-

# Lookup weights for each Order.No in the attr table
# Add up total weight for the combination and keep it if it's in the range
wgts <- lapply(combn, function(x) {
    temp <- attr$Weight[match(x, attr$Order.No)]
    temp <- sum(temp)
    temp[temp <= 50 & temp >= 20]
})
> wgts
[[1]]
[1] 20

[[2]]
numeric(0)

[[3]]
[1] 50

[[4]]
numeric(0)

[[5]]
numeric(0)

[[6]]
numeric(0)

[[7]]
numeric(0)

# Lookup volumes for each Order.No in the attr table
# Add up total volume for the combination and keep it if it's in the range
vols <- lapply(combn, function(x) {
    temp <- attr$Volume[match(x, attr$Order.No)]
    temp <- sum(temp)
    temp[temp <= 50 & temp >= 10]
})
> vols
[[1]]
[1] 10

[[2]]
[1] 30

[[3]]
[1] 25

[[4]]
[1] 40

[[5]]
[1] 35

[[6]]
numeric(0)

[[7]]
numeric(0)

Then use mapply to merge the two lists of weights and volumes. 然后使用mapply合并权重和体积的两个列表。

# Find and keep only the rows that have both the weights and volumes within their ranges  
which(lapply(mapply(c, wgts, vols), function(x) length(x)) == 2)

# Yields position 1 and 3 which meet the subsetting conditions    
> value value 
    1     3

The code above looks up the individual order weights and cubes, sums them all together, checks to make sure they are within each range limit, merges both lists together and keeps only those that have both the weight and cubes within the acceptable ranges. 上面的代码查找各个订单的权重和多维数据集,将它们总计在一起,检查以确保它们在每个范围限制内,将两个列表合并在一起,并且仅将权重和多维数据集都在可接受范围内的那些保持。

My current solution, which successfully completes the task, is terribly slow on production volume and does not scale well with millions of records. 我当前的解决方案可以成功完成任务,但是生产速度非常慢,并且无法很好地扩展数百万条记录。 With 11 MM order combinations to lookup, this process takes ~40 minutes to run, which is unacceptable. 使用11 MM订单组合进行查找,此过程大约需要40分钟才能运行,这是不可接受的。

I'm seeking a more efficient method that will drastically reduce the run-time required to produce the same output. 我正在寻求一种更有效的方法,该方法将大大减少产生相同输出所需的运行时间。

# changing names, assigning indices to order list
atdf  = data.frame(Order.No = c(111,222,333), Weight = c(20,75,50), Volume = c(10,30,25))
olist = list(111, 222, 333, c(111,222), c(111,333), c(222,333), c(111,222,333))
olist <- setNames(olist,seq_along(olist))

# defining filtering predicate:

sel_orders = function(os, mins=c(20,10), maxs=c(50,50)) {
    tot = colSums(atdf[match(os, atdf$Order.No), c("Weight","Volume")])
    all(maxs >= tot & tot >= mins)
}

# Filtering orders

olist[sapply(olist, sel_orders)]
# or 
Filter(x = olist, f = sel_orders)

both of which give 两者都给

# $`1`
# [1] 111
# 
# $`3`
# [1] 333

To change the maxes and mins... 更改最大值和最小值...

olist[ sapply(olist, sel_orders, mins = c(0,0), maxs = c(70,70)) ]

# $`1`
# [1] 111
# 
# $`3`
# [1] 333
# 
# $`5`
# [1] 111 333

Don't know how much faster this will be, but here's a dplyr/tidyr solution. 不知道这样做会有多快,但这是一个dplyr / tidyr解决方案。

library(dplyr)
library(tidyr)

combination = 
  data_frame(Order.No = combn) %>%
  mutate(combination_ID = 1:n()) %>%
  unnest(Order.No)

acceptable = 
  combination %>%
  left_join(attr) %>%
  group_by(combination_ID) %>%
  summarize(total_weight = sum(Weight),
         total_volume = sum(Volume)) %>%
  filter(total_weight %>% between(20, 50) &
           total_volume %>% between(10, 50) )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM