如何防止具有重復索引/鍵的行附加到data.frame？

Question

我有數據，其中兩個變量（“ManufactererId”和“ProductId”）的組合構成唯一的鍵/標識符。 數據如下所示：

my.data <- data.frame(ManufactererId = c(1, 1, 2, 2),
                      ProductId = c(1, 2, 1, 7),
                      Price = c(12.99, 149.00, 0.99, 3.99))
my.data
#   ManufactererId ProductId  Price
# 1              1         1  12.99
# 2              1         2 149.00
# 3              2         1   0.99
# 4              2         7   3.99

我想確保我不會意外地添加另一行ManufactererId - ProductId等於表中已存在的行（就像數據庫表上的唯一約束一樣）。

也就是說，如果我嘗試向我的數據框添加ManufactererId = 2和ProductId = 7的行：

my.data <- rbind(my.data, data.frame(ManufactererId = 2, ProductId = 7, Price = 120.00))

......它應該失敗並出錯。 怎么能實現這一目標？

或者我應該使用不同的數據類型？

Answer 1

1）zoo是否方便取決於你想要做什么操作，但動物園對象有唯一索引。 我們可以通過將兩個Id列粘貼在一起來構造文本索引。

library(zoo)
z <- with(my.data, zoo(Price, paste(ManufactererId, ProductId)))

z <- c(z, zoo(90, "1 1")) # Error, not appended
z <- c(z, zoo(90, "1 3")) # OK

請注意，動物園對象的數據部分可以是如上所示的向量，也可以是矩陣，以防您在數據中只有Price。

2）SQLite這可以使用許多數據庫中的任何一個來完成，但我們將在這里使用SQLite。 首先，我們在SQLite數據庫中創建一個具有唯一索引的表，然后插入行。

library(RSQLite)

con <- dbConnect(SQLite())
dbWriteTable(con, "my", my.data, row.names = FALSE)
dbGetQuery(con, "create unique index ix on my(ManufactererId, ProductId)")

dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 1, 99)) # error
dbGetQuery(con, sprintf("insert into my values(%d, %d, %d)", 1, 13, 90)) # OK

Answer 2

你可以這樣做： keys是你的唯一密鑰

append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(ncol(DF) != ncol(to_be_appended) || !all(names(DF) %in% names(to_be_appended))){
    stop("must have the same columns")
  }
  if(nrow(merge(DF, to_be_appended, by=keys))==0){
    rbind(DF, to_be_appended)
  } else {
    stop("Trying to append douplicated indices")
  }
}

測試一下：

to_be_appended = data.frame(ManufactererId=2,ProductId=17,Price=3.99)
append_save(my.data, to_be_appended) # works
to_be_appended_err = data.frame(ManufactererId=2,ProductId=7,Price=3.99)
append_save(my.data, to_be_appended_err) # error

如果僅根據鍵列附加數據，則可以使用data.table ，如下所示：

append_save <- function(DF, to_be_appended, keys=c("ManufactererId", "ProductId")){
  if(!all(keys %in% names(to_be_appended))){
    stop("key-columns must be present")
  }
  if(nrow(data.table::merge(DF, to_be_appended, on=keys))==0){
    data.table::setDF(data.table::rbindlist(list(DF, to_be_appended), fill = TRUE))[]
  } else {
    stop("Trying to append douplicated indices")
  }
}

Answer 3

在基礎R中執行此操作的一種方法是將environment用作字典或像對象一樣的散列映射。 my.dict < - new.env（）

首先，編寫一些輔助函數

make_key <- function(ManufactererId, ProductId)
  paste(ManufactererId, ProductId)

set_value <- function(key, value, dict){
         ## checking here assures desired behavior 
         if(any(key %in% names(dict)))
            stop("This key has been used")
         assign(key, value,  envir=dict)
}

然后，你可以生成像這樣的鍵

keys <- make_key(my.data[[1]], my.data[[2]])

設置值，你需要更加小心

# don't just do this as the first element is used by assign
# set_value(keys, my.data[[3]], dict=my.dict)

mapply(set_value, keys, my.data[[3]], MoreArgs = list(dict=my.dict))
ls.str(my.dict) # better than str for environments
# 1 1 :  num 13
# 1 2 :  num 149
# 2 1 :  num 0.99
# 2 7 :  num 3.99

set_value("1 1", 4, my.dict)
# Error in set_value("1 1", 4, my.dict) : This key has been used

Answer 4

rbind新數據的簡單方法，不包括重復項：

library(data.table)
my.data = data.table(ManufactererId = c(1, 1, 2, 2),
                     ProductId = c(1, 2, 1, 7),
                     Price = c(12.99, 149.00, 0.99, 3.99),
                     key = c("ManufactererId","ProductId"))
x = my.data # my data will be called 'x'
y = data.table(ManufactererId = 2, ProductId = 7, Price = 120.00)
rbind(x, y[!x, on=key(x)])
#   ManufactererId ProductId  Price
#1:              1         1  12.99
#2:              1         2 149.00
#3:              2         1   0.99
#4:              2         7   3.99

雖然您不需要設置鍵，但只需直接on參數on提供字符向量。 我認為使用密鑰是值得的，它只是反映了我們對數據結構的業務期望。

如果您想在這種情況下引發錯誤，可以使用以下命令：

unique.rbind = function(x, y, by=key(x)) {
    if (nrow(x[y, nomatch=0L, on=by])) stop("duplicates in 'y'")
    rbind(x, y)
}
unique.rbind(x, y)
# Error in unique.rbind(x, y) : duplicates in 'y'

如果出錯，則不會插入任何y行。

如何防止具有重復索引/鍵的行附加到data.frame？

問題描述

4 個解決方案

解決方案1
7 2016-03-20 11:30:38

解決方案2
7 2016-03-20 11:34:29

解決方案3
1 2016-04-01 23:26:37

解決方案4
0 2016-04-01 23:46:17

如何防止具有重復索引/鍵的行附加到data.frame？

問題描述

4 個解決方案

解決方案1 7 2016-03-20 11:30:38

解決方案2 7 2016-03-20 11:34:29

解決方案3 1 2016-04-01 23:26:37

解決方案4 0 2016-04-01 23:46:17

解決方案1
7 2016-03-20 11:30:38

解決方案2
7 2016-03-20 11:34:29

解決方案3
1 2016-04-01 23:26:37

解決方案4
0 2016-04-01 23:46:17