简体   繁体   中英

R, in data.table, select some rows

I have this strange situation. I am simply trying to select some rows from data.table.

dput(DT)
structure(list(Date = structure(c(10959, 10960, 10961, 10962, 
10963, 10966, 10967, 10968, 10969, 10970, 10974, 10975, 10976, 
10977, 10980, 10981, 10982, 10983, 10984, 10987), class = "Date"), 
    A = c(51.502148, 47.567955, 44.61731, 42.918453, 46.494991, 
    49.311516, 48.640915, 47.657368, 48.372677, 48.909157, 51.144493, 
    50.071529, 48.730328, 49.177395, 48.998569, 48.417381, 48.864449, 
    48.953861, 48.685623, 47.344421), AA = c(96.840897, 97.561798, 
    103.329002, 101.598839, 101.406601, 101.214363, 100.397339, 
    99.820618, 97.802101, 96.120003, 93.717003, 93.813118, 88.093979, 
    90.400864, 88.045921, 86.748299, 85.450684, 84.489479, 83.287979, 
    83.432159), AAC = c(NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_), AACG = c(NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_)), row.names = c(NA, -20L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7fa9640148e0>, sorted = "Date")

StartDate <- as.Date("2000-01-05")
TestDates <- c(StartDate,
               StartDate + duration(6, units = "day"),
               StartDate + duration(2, units = "week"))

DT[Date %in% TestDates, ]  # works well here.

The real data of "DT" has 20 million rows. Using this same block of codes, R reported:

Empty data.table (0 rows and 7347 cols)

Does anyone know how to pick rows using a vector, in a more reliable way?

I found the problem. In this line of code:

StartDate <- as.Date("2000-01-05")

I was trying to set the base date and then use the following codes to get different dates.

TestDates <- c(StartDate,
               StartDate + duration(6, units = "day"),
               StartDate + duration(2, units = "week"))

But using duration is wrong. Instead, I need:

TestDates <- c(StartDate,
               StartDate + days(6),
               StartDate + weeks(2))

In my case, I need to get data from different years, for example, 2000-01-01 and 2020-01-01. Using periods like seconds , minutes , hours , days , months , weeks and years work on human level and I do not need to worry about leap years. For example:

StartDate <- ymd("2020-01-01") # note, 2020 is leap year

StartDates + duration(1, units = "year")
>[1] "2020-12-31 06:00:00 UTC"

StartDates + years(1)
>[1] "2021-01-01"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM