简体   繁体   中英

Subset of a data frame with the penultimate values of one of the columns

I have a data.frame with lots of columns, one of them has the code of the sample area and another one has the number of the sample. I want to subset the information just from the penultimate sample in each sample area. I've tried many different things...in the end this is my best guess...but it's still not working.

site <- sample (1:3, 10, replace= T)
d2 <- sample (1:5, 10, replace= T)
d3 <- sample (1:5, 10, replace= T)
samplet <- sample (1:4, 10, replace= T)
mydata <- data.frame (cbind(site, d2, d3, samplet))

penultimate <- matrix(NA,,) # here I dont know how the return will be, as I dont know    how the dataframe will change
si <- matrix (NA, , )  
pl <- unique (site)
for (i in 1:(length (pl))) {
    si <-  mydata[which (samplet==pl[i]),] # I tried to create a temporary matrix, so I can calculate each site at a time
    penultimate <- si[which (si$samplet!=(max(si$samplet[si$samplet!=max(si$samplet)]))),]
}

Cheers!

A simple way is to use data.table and its built in .N value

# assuming `d1` is the column from which you want to find the penultimate

mydata <- data.frame(d1=strsplit("AAABBCCCCCDD", "")[[1]], d2=rnorm(12), d3=LETTERS[1:12], d4=c(101:103, 201:202, 301:305, 401:402))

DT <- data.table(mydata)

DT[, .SD[.N-1], by=d1]

   d1         d2 d3  d4
1:  A  1.6906714  B 102
2:  B -0.1239458  D 201
3:  C -0.2976339  I 304
4:  D  0.6858120  K 401

Compare with mydata

> mydata
   d1         d2 d3  d4
1   A  0.5986002  A 101
2   A  1.6906714  B 102   <~~~~  \
3   A -0.3253657  C 103
4   B -0.1239458  D 201   <~~~~   -\
5   B  0.8261401  E 202
6   C  0.0601318  F 301             Penultimate Values by d1
7   C -0.9766622  G 302
8   C  0.1028259  H 303
9   C -0.2976339  I 304   <~~~~~  -/ 
10  C -1.1467000  J 305
11  D  0.6858120  K 401   <~~~~~  / 
12  D -0.6160335  L 402

edit, updated with new sample data.

Here's a solution using tapply using @Ricardo's data:

# data (thanks @Ricardo)
set.seed(1234)
mydata <- data.frame(d1=strsplit("AAABBCCCCCDD", "")[[1]], 
             d2=rnorm(12), d3=LETTERS[1:12], 
             d4=c(101:103, 201:202, 301:305, 401:402))

# solution
idx <- unlist(tapply(seq_len(nrow(mydata)), mydata$d1, function(x) x[length(x)-1]))
mydata[idx, ]
#    d1         d2 d3  d4
# 2   A  0.2774292  B 102
# 4   B -2.3456977  D 201
# 9   C -0.5644520  I 304
# 11  D -0.4771927  K 401

The unlist is required in case there's just 1 row for a particular value for id1 .


What does the code do?

I'll explain as good as I can by breaking the function. Looking at the line idx <- ... , the function tapply splits the sequence c(1, 2, ... nrow(mydata)) (here, nrow(mydata) = 12 ) by the column mydata$d1 . That is:

tapply(1:12, mydata$d1, c) # just to show what happens here
$A
[1] 1 2 3

$B
[1] 4 5

$C
[1]  6  7  8  9 10

$D
[1] 11 12 

Now, instead of the function c we need the last-but-one element of each of these elements. So, we create a function(x) x[length(x)-1] where each of these A, B, C, D is passed one by one and the code x[length(x)-1] selects the last-but-one element each time . These give you the row index of all penultimate rows. So, just subset the data.frame by mydata[idx, ] .

In addition to the previous answers, it is also possible to do this with dplyr :

set.seed(1234)
mydata <- data.frame(d1=strsplit("AAABBCCCCCDD", "")[[1]], 
                 d2=rnorm(12), d3=LETTERS[1:12], 
                 d4=c(101:103, 201:202, 301:305, 401:402))

require(dplyr)

mydata %.%                 
  group_by(d1) %.% 
  mutate(count = 1:n()) %.% 
  filter(count %in% max(c(count-1,1))) %.%   
  select(-count)

As in @BondedDust's answer i assume you use the solitary row if there is only one row for any given d1 "group"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM