简体   繁体   English

从具有最大最小值的数据框到每个键的值

[英]From dataframe with values per min max to value per key

I have a dataframe with values defined per bucket. 我有一个数据框,每个值都定义了一个值。 (See df1 below) Now I have another dataframe with values within those buckets for which I want to look up a value from the bucketed dataframe (See df2 below) (请参见下面的df1)现在,我还有另一个数据框,其中的值包含这些存储桶中的值,我想从这些数据桶中查找值(请参见下文的df2)

Now I would like to have the result df3 below. 现在,我想在下面得到结果df3。

df1 <- data.frame(MIN = c(1,4,8), MAX = c(3, 6, 10), VALUE = c(3, 56, 8))
df2 <- data.frame(KEY = c(2,5,9))
df3 <- data.frame(KEY = c(2,5,9), VALUE = c(3, 56, 8))

> df1
  MIN MAX VALUE
1   1   3     3
2   4   6    56
3   8  10     8
> df2
  KEY
1   2
2   5
3   9
> df3
  KEY VALUE
1   2     3
2   5    56
3   9     8

EDIT : Extended the example. 编辑:扩展示例。

> df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
> df2 <- data.frame(KEY = c(2,5,9,18,3))
> df3 <- data.frame(KEY = c(2,5,9,18,3), VALUE = c(3, 56, 3, 5, 3))
> df1
  MIN MAX VALUE
1   1   3     3
2   4   6    56
3   8  10     3
4  14  18     5
> df2
  KEY
1   2
2   5
3   9
4  18
5   3
> df3
  KEY VALUE
1   2     3
2   5    56
3   9     3
4  18     5
5   3     3

This solution assumes that KEY , MIN and MAX are integers, so we can create a sequence of keys and then join. 此解决方案假定KEYMINMAX是整数,因此我们可以创建一个键序列,然后进行联接。

df1 <- data.frame(MIN = c(1,4,8, 14), MAX = c(3, 6, 10, 18), VALUE = c(3, 56, 3, 5))
df2 <- data.frame(KEY = c(2,5,9,18,3))

library(dplyr)
library(purrr)
library(tidyr)

df1 %>%
  group_by(VALUE, id=row_number()) %>%             # for each value and row id
  nest() %>%                                       # nest rest of columns
  mutate(KEY = map(data, ~seq(.$MIN, .$MAX))) %>%  # create a sequence of keys
  unnest(KEY) %>%                                  # unnest those keys
  right_join(df2, by="KEY") %>%                    # join the other dataset
  select(KEY, VALUE) 

# # A tibble: 5 x 2
#     KEY VALUE
#   <dbl> <dbl>
# 1  2.00  3.00
# 2  5.00 56.0 
# 3  9.00  3.00
# 4 18.0   5.00
# 5  3.00  3.00

Or, group just by the row number and add VALUE in the map : 或者,仅按行号分组,然后在map添加VALUE

df1 %>%
  group_by(id=row_number()) %>% 
  nest() %>%                 
  mutate(K = map(data, ~data.frame(VALUE = .$VALUE, 
                                   KEY = seq(.$MIN, .$MAX)))) %>%
  unnest(K) %>%
  right_join(df2, by="KEY") %>% 
  select(KEY, VALUE)

A very good and well-thought-out solution from @AntioniosK. @AntioniosK提供的一个非常好的,经过深思熟虑的解决方案。

Here's a base R solution implemented as a general lookup function given as arguments a key dataframe and a bucket dataframe defined as listed in the question. 这是一个实现为通用查找功能的基本R解决方案,该功能以参数形式给出了问题中所定义的关键数据帧和存储桶数据帧。 The lookup values need not be unique or contiguous in this example, taking account of @Michael's comment that values may occur in more than one row (though normally such lookups would use unique ranges). 在此示例中,考虑到@Michael的评论值可能会出现在多行中,因此查找值不必是唯一的或连续的(尽管通常此类查找将使用唯一的范围)。

lookup = function(keydf, bucketdf){
  keydf$rowid = 1:nrow(keydf)
  T = merge(bucketdf, keydf)
  T = T[T$KEY >= T$MIN & T$KEY <= T$MAX,]
  T = merge(T, keydf, all.y = TRUE)
  T[order(T$rowid), c("rowid", "KEY", "VALUE")]
}

The first merge uses a Cartesian join of all rows in the key to all rows in the bucket list. 第一次合并使用键中所有行到存储区列表中所有行的笛卡尔连接。 Such joins can be inefficient if the number of rows in the real tables is large, as the result of joining x rows in the key to y rows in the bucket would be xy rows; 如果实际表中的行数很大,则这种联接可能效率不高,因为将键中的x行与存储桶中的y行联接的结果将是xy行; I doubt this would be a problem in this case unless x or y run into thousands of rows. 我怀疑在这种情况下这将是一个问题,除非x或y遇到成千上万的行。

The second merge is done to recover any key values which are not matched to rows in the bucket list. 完成第二次合并以恢复与存储桶列表中的行匹配的所有键值。

Using the example data as listed in @AntioniosK's post: 使用@AntioniosK帖子中列出的示例数据:

> lookup(df2, df1)
  rowid KEY VALUE
2     1   2     3
4     2   5    56
5     3   9     3
1     4  18     5
3     5   3     3

Using key and bucket exemplars that test edge cases (where the key = the min or the max), where a key value is not in the bucket list (the value 50 in df2A), and where there is a non-unique range (row 6 of df4 below): 使用密钥和存储桶示例来测试边缘情况(其中密钥=最小值或最大值),密钥值不在存储桶列表中(df2A中的值50)以及存在非唯一范围(行)以下df4中的6个):

df4 <- data.frame(MIN = c(1,4,8, 20, 30, 22), MAX = c(3, 6, 10, 25, 40, 24), VALUE = c(3, 56, 8, 10, 12, 23))
df2A <- data.frame(KEY = c(3, 6, 22, 30, 50))

df4
  MIN MAX VALUE
1   1   3     3
2   4   6    56
3   8  10     8
4  20  25    10
5  30  40    12
6  22  24    23

> df2A
  KEY
1   3
2   6
3  22
4  30
5  50

> lookup(df2A, df4)
  rowid KEY VALUE
1     1   3     3
2     2   6    56
3     3  22    10
4     3  22    23
5     4  30    12
6     5  50    NA

As shown above, the lookup in this case returns two values for the non-unique ranges matching the key value 22, and NA for values in the key but not in the bucket list. 如上所示,在这种情况下,查找针对与键值22匹配的非唯一范围返回两个值,对于键中但不在存储区列表中的值返回NA。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM