R在多個標准上交叉data.frame

Question

我試圖基於多個條件上的兩個data.frames的交集來填充二進制向量。

我有代碼工作，但我覺得只是為了得到二進制矢量是內存過多。

當我將我的代碼應用於我的完整數據（40毫米+行）時。 我開始有記憶問題。

是否有更簡單的方法來生成向量？

以下是一些樣本數據（例如，子樣本僅包括完整樣本中的obs。）：

ob1_1 <- as.data.frame(cbind(c(1999),c("111","222","666","777")),stringsAsFactors=FALSE)
ob2_1 <- as.data.frame(cbind(c(2000),c("111","333","555","777")),stringsAsFactors=FALSE)
ob3_1 <- as.data.frame(cbind(c(2001),c("111","222","333","777")),stringsAsFactors=FALSE)
ob4_1 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

full_sample <-  rbind(ob1_1,ob2_1,ob3_1,ob4_1)
colnames(full_sample) <- c("yr","ID")

ob1_2 <- as.data.frame(cbind(c(1999),c("111","222","777")),stringsAsFactors=FALSE)
ob2_2 <- as.data.frame(cbind(c(2000),c("333")),stringsAsFactors=FALSE)
ob3_2 <- as.data.frame(cbind(c(2001),c("888")),stringsAsFactors=FALSE)
ob4_2 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

sub_sample <-  rbind(ob1_2,ob2_2,ob3_2,ob4_2)
colnames(sub_sample) <- c("yr","ID")

這是我的工作代碼：

q_intersect <- ""
q_intersect <- paste(q_intersect , "select       a.yr, a.ID       ", sep=" ")
q_intersect <- paste(q_intersect , "from         full_sample a  ", sep=" ")
q_intersect <- paste(q_intersect , "intersect                     ", sep=" ")
q_intersect <- paste(q_intersect , "select       b.yr, b.ID       ", sep=" ")
q_intersect <- paste(q_intersect , "from         sub_sample b  ", sep=" ")
q_intersect <- trim(gsub(" {2,}", " ", q_intersect ))

intersect_temp <- cbind(sqldf(q_intersect ),1)
colnames(intersect_temp ) <- c("yr","ID","in_both")

q_expand <- ""
q_expand <- paste(q_expand , "select       in_both            ", sep=" ")
q_expand <- paste(q_expand , "from         full_sample a      ", sep=" ")
q_expand <- paste(q_expand , "left join    intersect_temp  b  ", sep=" ")
q_expand <- paste(q_expand , "on           a.yr=b.yr          ", sep=" ")
q_expand <- paste(q_expand , "and          a.ID=b.ID          ", sep=" ")
q_expand <- trim(gsub(" {2,}", " ", q_expand ))

solution <- as.integer(sqldf(q_expand)[,1])
solution [is.na(solution )] <- 0

提前感謝您的幫助！

Answer 1

你想要完成的事情並不完全清楚，但我相信這樣的事情會簡單得多。

library(data.table)
fullDT <- data.table(full_sample, key=c("yr", "ID"))
subDT  <- data.table(sub_sample,  key=c("yr", "ID"))

fullDT[ , intersect := 0L]
fullDT[subDT, intersect := 1, nomatch=0]

我們的想法是將每個data.table的key設置為要交叉的列。 當您調用full[sub], nomatch=0]您將獲得內部full[sub], nomatch=0] ，並且我們僅將這些值設置為1 ; 內部聯接中未標識的值保留為0 ，如前一行中所設置。

fullDT
#        yr  ID intersect
#   1: 1999 111         1
#   2: 1999 222         1
#   3: 1999 666         0
#   4: 1999 777         1
#   5: 2000 111         0
#   6: 2000 333         1
#   7: 2000 555         0
#   8: 2000 777         0
#   9: 2001 111         0
#  10: 2001 222         0
#  11: 2001 333         0
#  12: 2001 777         0
#  13: 2002 111         1
#  14: 2002 444         1
#  15: 2002 555         1
#  16: 2002 777         1

Answer 2

更簡單的SQL我收集到你希望創建一個與full_sample具有相同行數的一列數據幀，這樣如果full_sample的相應行具有匹配的sub_sample行，則輸出中的給定行包含1，否則為0。

在這種情況下，可以將多個SQL語句壓縮為一個更簡單的SQL語句，如下所示。 左連接確保包含full_sample所有行，並且自然連接使連接在兩個輸入數據幀之間共同的所有列名上發生。

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s")

（順便說一下，請注意，字符串文字可以在多行上流動，因為這樣就不必將多行粘貼在一起。）

超出內存數據庫 sqldf默認使用內存數據庫，但您可以通過dbname=參數指定文件名（不需要提前存在），以用作內存數據庫。 在這種情況下，您不會受到記憶的限制。

sqldf("select s.yr is not null as solution 
       from full_sample f natural left join sub_sample s", dbname = "mydb")

（在某些情況下，您可以通過使用索引來提高性能。請參閱sqldf主頁以獲取示例。）

更新：添加更簡單的sql解決方案

R在多個標准上交叉data.frame

問題描述

2 個解決方案

解決方案1
4 已采納 2013-03-24 05:03:01

解決方案2
2 2013-03-24 05:31:11

R在多個標准上交叉data.frame

問題描述

2 個解決方案

解決方案1 4 已采納 2013-03-24 05:03:01

解決方案2 2 2013-03-24 05:31:11

解決方案1
4 已采納 2013-03-24 05:03:01

解決方案2
2 2013-03-24 05:31:11