I have a data.frame with lots of columns, one of them has the code of the sample area and another one has the number of the sample. I want to subset the information just from the penultimate sample in each sample area. I've tried many different things...in the end this is my best guess...but it's still not working.
site <- sample (1:3, 10, replace= T)
d2 <- sample (1:5, 10, replace= T)
d3 <- sample (1:5, 10, replace= T)
samplet <- sample (1:4, 10, replace= T)
mydata <- data.frame (cbind(site, d2, d3, samplet))
penultimate <- matrix(NA,,) # here I dont know how the return will be, as I dont know how the dataframe will change
si <- matrix (NA, , )
pl <- unique (site)
for (i in 1:(length (pl))) {
si <- mydata[which (samplet==pl[i]),] # I tried to create a temporary matrix, so I can calculate each site at a time
penultimate <- si[which (si$samplet!=(max(si$samplet[si$samplet!=max(si$samplet)]))),]
}
Cheers!
A simple way is to use data.table
and its built in .N
value
# assuming `d1` is the column from which you want to find the penultimate
mydata <- data.frame(d1=strsplit("AAABBCCCCCDD", "")[[1]], d2=rnorm(12), d3=LETTERS[1:12], d4=c(101:103, 201:202, 301:305, 401:402))
DT <- data.table(mydata)
DT[, .SD[.N-1], by=d1]
d1 d2 d3 d4
1: A 1.6906714 B 102
2: B -0.1239458 D 201
3: C -0.2976339 I 304
4: D 0.6858120 K 401
> mydata
d1 d2 d3 d4
1 A 0.5986002 A 101
2 A 1.6906714 B 102 <~~~~ \
3 A -0.3253657 C 103
4 B -0.1239458 D 201 <~~~~ -\
5 B 0.8261401 E 202
6 C 0.0601318 F 301 Penultimate Values by d1
7 C -0.9766622 G 302
8 C 0.1028259 H 303
9 C -0.2976339 I 304 <~~~~~ -/
10 C -1.1467000 J 305
11 D 0.6858120 K 401 <~~~~~ /
12 D -0.6160335 L 402
edit, updated with new sample data.
Here's a solution using tapply
using @Ricardo's data:
# data (thanks @Ricardo)
set.seed(1234)
mydata <- data.frame(d1=strsplit("AAABBCCCCCDD", "")[[1]],
d2=rnorm(12), d3=LETTERS[1:12],
d4=c(101:103, 201:202, 301:305, 401:402))
# solution
idx <- unlist(tapply(seq_len(nrow(mydata)), mydata$d1, function(x) x[length(x)-1]))
mydata[idx, ]
# d1 d2 d3 d4
# 2 A 0.2774292 B 102
# 4 B -2.3456977 D 201
# 9 C -0.5644520 I 304
# 11 D -0.4771927 K 401
The unlist
is required in case there's just 1 row for a particular value for id1
.
I'll explain as good as I can by breaking the function. Looking at the line idx <- ...
, the function tapply
splits the sequence c(1, 2, ... nrow(mydata))
(here, nrow(mydata) = 12
) by the column mydata$d1
. That is:
tapply(1:12, mydata$d1, c) # just to show what happens here
$A
[1] 1 2 3
$B
[1] 4 5
$C
[1] 6 7 8 9 10
$D
[1] 11 12
Now, instead of the function c
we need the last-but-one element of each of these elements. So, we create a function(x) x[length(x)-1]
where each of these A, B, C, D
is passed one by one and the code x[length(x)-1]
selects the last-but-one element each time . These give you the row index of all penultimate rows. So, just subset the data.frame by mydata[idx, ]
.
In addition to the previous answers, it is also possible to do this with dplyr
:
set.seed(1234)
mydata <- data.frame(d1=strsplit("AAABBCCCCCDD", "")[[1]],
d2=rnorm(12), d3=LETTERS[1:12],
d4=c(101:103, 201:202, 301:305, 401:402))
require(dplyr)
mydata %.%
group_by(d1) %.%
mutate(count = 1:n()) %.%
filter(count %in% max(c(count-1,1))) %.%
select(-count)
As in @BondedDust's answer i assume you use the solitary row if there is only one row for any given d1 "group"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.