I have two dataframes.
df1 looks like this:
chr <- c("1","1","2")
pos <- c("1000","2000","2000")
df1=data.frame(cbind(tmp1,tmp2))
df1
chr pos
1 1000
1 2000
2 2000
df2 looks like this:
chr <- c("1","1","1","2","2")
start <- c("500","1500","2500","500","1500")
end <- c("1499","2499","3499","1499","2499")
state <- c("state1", "state2", "state1", "state3", "state4")
df2=data.frame(cbind(chr,start,end,state))
df2
chr start end state
1 500 1499 state1
1 1500 2499 state2
1 2500 3499 state1
2 500 1499 state3
2 1500 2499 state4
I would like to add a column state
to the first dataframe, based on the value in column df1$chr
being the same as df2$chr
and the value in df1$pos
being between those in df2$start
and df2$end
. The intended end result would look like this:
chr pos state
1 1000 state1
1 2000 state2
2 2000 state4
I know how to do this if the values in df2$start
were the same as those in df1$pos
, but it is the range that I am struggling with.
Any tips would be very useful.
Being a SQL inclined person, I might go with a sqldf
option here:
library(sqldf)
query <- "select df1.chr, df1.pos, df2.state
from df1
left join df2
on df1.chr = df2.chr and
df1.pos between df2.start and df2.end"
df1 <- sqldf(query, stringsAsFactors=FALSE)
Edit:
Your pos
, start
, and end
columns should be numeric in my opinion, because you need to do comparisons involving numbers, not text. So cast them all to numeric and the above solution should work:
df1$pos <- as.numeric(df1$pos)
df2$start <- as.numeric(df2$start)
df2$end <- as.numeric(df2$end)
We could use a non-equi join with data.table
library(data.table)
setDT(df1)[df2, state := state, on = .(chr, pos > start, pos < end)]
df1
# chr pos state
#1: 1 1000 state1
#2: 1 2000 state2
#3: 2 2000 state4
NOTE: when constructing data.frame
, avoid data.frame(cbind
because cbind
converts to a matrix
and matrix
can hold only a single class. Use the data.frame
directly. Another problem with the example data is using a string variable for 'pos', 'start', 'end'. It should be numeric
class
chr <- c("1","1","2")
pos <- c(1000,2000,2000)
df1 <- data.frame(chr, pos)
chr <- c("1","1","1","2","2")
start <- c(500,1500,2500,500,1500)
end <- c(1499,2499,3499,1499,2499)
df2 <- data.frame(chr, start, end, state)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.