简体   繁体   中英

Data table subset using a condition, and only specific rows from that subset

I have a big data table that is rendered in a plotly plot, and the user can select points in the scatter plot with a region (rectangular or polygon). After processing this select element, I get a dataframe with a list of the 'cluster' each point in the selection belongs to and which 'nth' element (row) of that cluster it is, ie

cluster    elementNr    
1             1    
1             23    
2             7
2             9
2             14

This recreates the output:

selectme <- data.frame(cluster = c(rep(1,20), rep(2,30)), elementNr = c(sample(1:100, 20, replace = F),sample(1:100,30, replace =F)))

And this creates the dummy equivalent of my main data table

mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), val1 = runif(5000, 0.1, 100), val2 = runif(5000, 0.1, 50))

What I am trying to figure out, is how to subset the data table with the selectme dataframe so that I get the right particles out of there. ie as the first example, I should end up with the 1st and 23rd row of the rows where cluster == 1, and the 7th, 9th, 14th row of those where cluster == 2

Expected output:

Cluster   val1    val2
 1        0.45    2.54     (1st of this cluster 
 1        25.73   84.83  (23rd element of this cluster 
 2        0.26    12.76   (7th el)
 2        3.16    26.84   (9th el)
 2        0.76    19.46   (14th el)
  • values are just examples since the test code creates them at random

So 7th element is not the 7th row of mydata, but the 7th row of the subset where cluster ==2 And because my data can contain dozens of clusters I would like to put the following logical sentence into code: "For each cluster in selectme, find the rows of the subset of mydata where cluster equals the same cluster and grab the rows that equal the elementnrs of selectme for the current cluster, and produce one output containing all the corresponding rows"

I can transfer myselection to data.table first if needed of course

UPDATE: real data added

Real data file, real selectme file, and expected output file in RDS form are found here: datafiles

Using the data you provided:

mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), 
                     val1 = runif(5000, 0.1, 100), 
                     val2 = runif(5000, 0.1, 50))

selectme <- data.frame(cluster = c(rep(1,20), rep(2,30)), 
                       elementNr = c(sample(1:100, 20, replace = F),
                       sample(1:100,30, replace =F)))

# One at a time:
mydata[cluster == 1][selectme[selectme$cluster == 1,]$elementNr]
#    cluster       val1      val2
# 1:       1 92.7082853 34.879758
# 2:       1  6.2921762 19.203557
# 3:       1 32.5258127  3.731329
# 4:       1 93.5793112 34.937362
# 5:       1 62.0401740 42.491657
# and so on...

# Programmatically

clusters <- unique(selectme$cluster)
tobind <- vector("list", length(clusters))
for (i in seq_along(tobind)){
  tobind[[i]] <- mydata[cluster == clusters[i]][selectme[selectme$cluster == clusters[i],]$elementNr]
}

selected <- rbindlist(tobind)

After creating an index for each row, you can do an inner join using the data.table as follows

set.seed(8L)
selectme <- data.table(cluster = c(rep(1L,20), rep(2L,30)), elementNr = c(sample(1:100, 20, replace = F),sample(1:100,30, replace =F)))
mydata <- data.table(cluster = sample(c(1:5), 5000, replace = T), val1 = runif(5000, 0.1, 100), val2 = runif(5000, 0.1, 50))

selectme[mydata[, rn := .I, by=.(cluster)],
    on=c("cluster", "elementNr"="rn"), nomatch=0L]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM