简体   繁体   中英

R - how to manipulate data by a condition and by columns

It seems like this should be an easy task with apply, but I still can't figure it out. I have data like this:

x1= c(1,1,2,3,1,2,4) 
x2= c(1,2,2,6,2,3,1) 
x3= c(1,1,1,0,0,0,0) 
x4= c(1,0,0,0,0,3,1) 

df=data.frame( x1,x2,x3,x4) 
df
  x1 x2 x3 x4
1  1  1  1  1
2  1  2  1  0
3  2  2  1  0
4  3  6  0  0
5  1  2  0  0
6  2  3  0  3
7  4  1  0  1 

And a vector like this:

m= c(1,1,0,0)
rbind(df,m)
df=rbind(df,m)
df
  x1 x2 x3 x4
1  1  1  1  1
2  1  2  1  0
3  2  2  1  0
4  3  6  0  0
5  1  2  0  0
6  2  3  0  3
7  4  1  0  1
8  1  1  0  0

Now I'd like for all the values in a column that are equal to the value on the last row (the m vector) in the same column to be changed to 0 and others to 1. For example df[1,2] is 1 which is the same as m[2] and so the value for df2[1,2] is 0. The new data set would then look like this:

df2
  x1 x2 x3 x4
1  0  0  1  1
2  0  1  1  0
3  1  1  1  0
4  1  1  0  0
5  0  1  0  0
6  1  1  0  1
7  1  0  0  1
8  1  1  0  0

Using the 'df' dataset after the rbind , we do the comparison between all rows except the last one ( df[-8,] ) and the last row that get replicated so that the lengths are the same. ( df[8,][col(df[-8,])] ). This will return a logical matrix, which can be coerced back to binary by wrapping with + . Then we rbind the binary output with the last row of 'df' ( df[8,] ) to get the expected output.

df2 <- rbind(+(df[-8,]!=df[8,][col(df[-8,])]), df[8,])
df2
#  x1 x2 x3 x4
#1  0  0  1  1
#2  0  1  1  0
#3  1  1  1  0
#4  1  1  0  0
#5  0  1  0  0
#6  1  1  0  1
#7  1  0  0  1
#8  1  1  0  0

Or as @DavidArenburg mentioned, this would be made more compact by comparing 'df' before the rbind step with the vector ('m').

m1 <-  rbind(+(df != m[col(df)]), m)
row.names(m1) <- NULL

Just to understand it better, we replicate the 'm' vector using the col function, which returns numeric column index of the 'df'

 col(df)
 #     [,1] [,2] [,3] [,4]
 #[1,]    1    2    3    4
 #[2,]    1    2    3    4
 #[3,]    1    2    3    4
 #[4,]    1    2    3    4
 #[5,]    1    2    3    4
 #[6,]    1    2    3    4
 #[7,]    1    2    3    4

using

 m[col(df)]
 #[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

the first element in 'm' ie 1 gets replicated 7 times, followed by the second element 1 with 7 times, and so on...

Now, the lengths are the same

 length( m[col(df)])
 #[1] 28
 prod(dim(df))
 #[1] 28

to have an element-by-element comparison.

 df != m[col(df)]
 #      x1    x2    x3    x4
 #[1,] FALSE FALSE  TRUE  TRUE
 #[2,] FALSE  TRUE  TRUE FALSE
 #[3,]  TRUE  TRUE  TRUE FALSE
 #[4,]  TRUE  TRUE FALSE FALSE
 #[5,] FALSE  TRUE FALSE FALSE
 #[6,]  TRUE  TRUE FALSE  TRUE
 #[7,]  TRUE FALSE FALSE  TRUE

In the last step, we coerce this to binary and rbind to 'm'.


Another option would be using the sweep with MARGIN=2

rbind(+(sweep(df, 2 ,m ,'!=')), m)

You could try the following:

df2 <- t(t(df) != m) * 1 # create a logical dataframe that compares rows with m 
    # and transpose result back to original format,
    # coerce TRUE and FALSE entries into numerical values by multiplying with 1
df2[nrow(df2),] <- m #keep the last row unchanged
#> df2
#  x1 x2 x3 x4
#1  0  0  1  1
#2  0  1  1  0
#3  1  1  1  0
#4  1  1  0  0
#5  0  1  0  0
#6  1  1  0  1
#7  1  0  0  1
#8  1  1  0  0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM