简体   繁体   English

当一个查找表时如何联接data.tables?

[英]How to join data.tables when one is a lookup table?

I'm having trouble applying a simple data.table join example to a larger (10GB) data set. 我在将简单的data.table连接示例应用于较大的(10GB)数据集时遇到麻烦。 merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. merge()在具有较大数据集的data.frames上工作得很好,尽管我很想利用data.table中的速度。 Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)? 谁能指出我对data.table的误解(尤其是错误消息)?

Here is the simple example (derived from this thread: Join of two data.tables fails ). 这是简单的示例(从该线程派生: 两个data.tables的连接失败 )。

# The data of interest.
(DT <- data.table(id    = c(rep(1154:1155, 2), 1160),
                  price = c(1.99, 2.50, 15.63, 15.00, 0.75), 
                  key   = "id"))

     id price
1: 1154  1.99
2: 1154 15.63
3: 1155  2.50
4: 1155 15.00
5: 1160  0.75

# Lookup table.
(lookup <- data.table(id      = 1153:1160, 
                      version = c(1,1,3,4,2,1,1,2), 
                      yr      = rep(2006, 4), 
                      key     = "id"))

     id version   yr
1: 1153       1 2006
2: 1154       1 2006
3: 1155       3 2006
4: 1156       4 2006
5: 1157       2 2006
6: 1158       1 2006
7: 1159       1 2006
8: 1160       2 2006

# The desired table.  Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]

     id price version   yr
1: 1154  1.99       1 2006
2: 1154 15.63       1 2006
3: 1155  2.50       3 2006
4: 1155 15.00       3 2006
5: 1160  0.75       2 2006

The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). 较大的数据集由两个数据帧组成:temp.3561(感兴趣的数据集)和temp.versions(查找数据集)。 They have the same structure as DT and lookup (above), respectively. 它们分别具有与DT和查找相同的结构。 Using merge() works well, however my application of data.table is clearly flawed: 使用merge()效果很好,但是我对data.table的应用显然存在缺陷:

# Merge data.frames: works just fine
long.merged         <- merge(temp.versions, temp.3561, by = "id")

# Convert the data.frames to data.tables
DTtemp.3561         <- as.data.table(temp.3561)
DTtemp.versions     <- as.data.table(temp.versions)

# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged       <- merge(DTtemp.versions, DTtemp.3561, by = "id")

Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  : 
  Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate 
key values in i, each of which join to the same group in x over and over again. If that's ok, 
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the 
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. 
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.

DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates). DTtemp.versions具有与查找相同的结构(在简单示例中),并且键“ id”由779,473个唯一值组成(没有重复项)。

DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates). DTtemp3561具有与DT相同的结构(在简单示例中),外加一些其他变量,但是尽管观察到7,946,667次观察(很多重复),其键“ id”也只有829个唯一值。

Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). 由于我只是想将DTtemp.versions的版本号和年份添加到DTtemp.3561中的每个观测值中,因此合并的data.table应该具有与DTtemp.3561相同的观测值数量(7,946,667)。 Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame. 具体来说,我不明白为什么merge()在使用data.table时会产生“多余”的观察,而在使用data.frame时却不能。

Likewise 同样地

# Same error message, but with 12,055,777 observations
altDTlong.merged   <- DTtemp.3561[DTtemp.versions]

# Same error message, but with 11,277,332 observations
alt2DTlong.merged  <- DTtemp.versions[DTtemp.3561]

Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations. 包括allow.cartesian = T和nomatch = 0不会删除“多余”的观察值。

Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables. 奇怪的是,如果我将感兴趣的数据集截断为10个观测值,则merge()在data.frames和data.tables上都可以正常工作。

# Merge short DF: works just fine
short.3561         <- temp.3561[-(11:7946667),]
short.merged       <- merge(temp.versions, short.3561, by = "id")

# Merge short DT
DTshort.3561       <- data.table(short.3561, key = "id")
DTshort.merged     <- merge(DTtemp.versions, DTshort.3561, by = "id")

I've been through the FAQ ( http://datatable.r-forge.r-project.org/datatable-faq.pdf , and 1.12 in particular). 我浏览过FAQ( http://datatable.r-forge.r-project.org/datatable-faq.pdf ,尤其是1.12)。 How would you suggest thinking about this? 您对此有何建议?

Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)? 谁能指出我对data.table的误解(尤其是错误消息)?

Taking you answer directly. 直接让您回答。 The error message 错误讯息

Join results in 11277332 rows; 连接结果为11277332行; more than 7946667 = max(nrow(x),nrow(i)). 大于7946667 = max(nrow(x),nrow(i))。 Check for duplicate key values in i... 在i ...中检查重复的键值...

states the result of your join has more values than usual cases expects. 说明您加入的结果具有比通常情况下期望的值更多的值。 This means the lookup table key has duplicates which results multiple matches on join. 这意味着查找表键具有重复项,从而导致联接时出现多个匹配项。

If it doesn't answer your question you should restate it. 如果它不能回答您的问题,则应重述。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM