简体   繁体   中英

How to fill a column using values from another dataframe in R

I have two dataframes.

df1 looks like this:

chr <- c("1","1","2")
pos <- c("1000","2000","2000")
df1=data.frame(cbind(tmp1,tmp2))
df1

chr    pos
1      1000
1      2000
2      2000

df2 looks like this:

chr <- c("1","1","1","2","2")
start <- c("500","1500","2500","500","1500")
end <- c("1499","2499","3499","1499","2499")
state <- c("state1", "state2", "state1", "state3", "state4")
df2=data.frame(cbind(chr,start,end,state))
df2

chr start  end  state
1   500    1499 state1
1   1500   2499 state2
1   2500   3499 state1
2   500    1499 state3
2   1500   2499 state4

I would like to add a column state to the first dataframe, based on the value in column df1$chr being the same as df2$chr and the value in df1$pos being between those in df2$start and df2$end . The intended end result would look like this:

chr    pos     state
1      1000    state1
1      2000    state2
2      2000    state4

I know how to do this if the values in df2$start were the same as those in df1$pos , but it is the range that I am struggling with.

Any tips would be very useful.

Being a SQL inclined person, I might go with a sqldf option here:

library(sqldf)
query <- "select df1.chr, df1.pos, df2.state
          from df1
          left join df2
              on df1.chr = df2.chr and
                 df1.pos between df2.start and df2.end"
df1 <- sqldf(query, stringsAsFactors=FALSE)

Edit:

Your pos , start , and end columns should be numeric in my opinion, because you need to do comparisons involving numbers, not text. So cast them all to numeric and the above solution should work:

df1$pos <- as.numeric(df1$pos)
df2$start <- as.numeric(df2$start)
df2$end <- as.numeric(df2$end)

We could use a non-equi join with data.table

library(data.table)
setDT(df1)[df2, state := state, on = .(chr, pos > start, pos < end)]
df1
#   chr  pos  state
#1:   1 1000 state1
#2:   1 2000 state2
#3:   2 2000 state4

NOTE: when constructing data.frame , avoid data.frame(cbind because cbind converts to a matrix and matrix can hold only a single class. Use the data.frame directly. Another problem with the example data is using a string variable for 'pos', 'start', 'end'. It should be numeric class

data

chr <- c("1","1","2")
pos <- c(1000,2000,2000)
df1 <- data.frame(chr, pos)
chr <- c("1","1","1","2","2")
start <- c(500,1500,2500,500,1500)
end <- c(1499,2499,3499,1499,2499)
df2 <- data.frame(chr, start, end, state)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM