I have a data frame containing 3 columns of non-integer values. The values in the respective columns allot of the time will be identical to values in the other one or two columns in the same data frame. If there are matches between columns I would like to have them on the same row.
See subset_df vs expected_subset_df below for clarification.
Notice that the values ending on "248:-" are in the same row in expected_subset_df but not in subset_df .
Summary: values in col1 can also be in col2 and/or col3. If the values between columns do match I want them on the same row.
> subset_df
col1 col2 col3
1 20:31722330:- 20:31722330:- 20:31722330:-
2 20:31722348:- 20:31724051:- 20:31724051:-
3 FALSE 20:31722348:- 20:31722348:-
> expected_subset_df
col1 col2 col3
1 20:31722330:- 20:31722330:- 20:31722330:-
2 20:31722348:- 20:31722348:- 20:31722348:-
3 FALSE 20:31724051:- 20:31724051:-
library(dplyr)
subset_df %>%
mutate_all(as.character) %>%
mutate(col1 = subset_df$col1[match(subset_df$col2, subset_df$col1)],
col3 = subset_df$col3[match(subset_df$col2, subset_df$col3)])
Yields:
col1 col2 col3
1 20:31722330:- 20:31722330:- 20:31722330:-
2 <NA> 20:31724051:- 20:31724051:-
3 20:31722348:- 20:31722348:- 20:31722348:-
Is this method robust? Is there a better alternative?
Suppose dataframe breakpoint looks like this:
> breakpoint
col1 col2 col3
1 20:31722330:- 20:31722344:- FALSE
2 21:15014555:- 21:15014555:- FALSE
3 21:15014767:- 21:15014767:- 21:15014767:-
How can I turn dataframe breakpoint into this:
> expected_breakpoint
col1 col2 col3
1 20:31722330:- <NA> <NA>
2 <NA> 20:31722344:- <NA>
3 21:15014555:- 21:15014555:- <NA>
4 <NA> <NA> FALSE
5 <NA> <NA> FALSE
6 21:15014767:- 21:15014767:- 21:15014767:-
FALSE
into <NA>
before analysisSuppose dataframe breakpoint_new looks like this:
> breakpoint_new
col1 col2 col3
1 20:31722330:- 20:31722344:- <NA>
2 21:15014555:- 21:15014555:- <NA>
3 21:15014767:- 21:15014767:- 21:15014767:-
How can I turn dataframe breakpoint_new into this:
> expected_breakpoint_new
col1 col2 col3
1 20:31722330:- <NA> <NA>
2 <NA> 20:31722344:- <NA>
3 21:15014555:- 21:15014555:- <NA>
4 21:15014767:- 21:15014767:- 21:15014767:-
The following function solves my problem:
match_columns = function(df, nomatch=F){
if (ncol(df) != 3){
stop("Input DataFrame needs to have 3 columns")
}
matrix = matrix(ncol = 3, nrow = 0)
match12 = intersect(df$object, df$object.1)
match23 = intersect(df$object.1, df$object.2)
match13 = intersect(df$object, df$object.2)
for (item in match12){
if (item == nomatch){next}
if (item %in% match23){
matrix = rbind(matrix, c(rep(item, 3)))
}else{
matrix = rbind(matrix, c(rep(item, 2), nomatch))
}
}
for (item in match13){
if (item == nomatch){next}
if (!(item %in% match12)){
matrix = rbind(matrix, c(item, nomatch, item))
}
}
for (item in match23){
if (item == nomatch){next}
if (!(item %in% match13)){
matrix = rbind(matrix, c(nomatch, rep(item, 2)))
}
}
for (item in df$object){
if (item == nomatch){next}
if (!(item %in% match12) & !(item %in% match13)){
matrix = rbind(matrix, c(item, rep(nomatch, 2)))
}
}
for (item in df$object.1){
if (item == nomatch){next}
if (!(item %in% match12) & !(item %in% match23)){
matrix = rbind(matrix, c(nomatch, item, nomatch))
}
}
for (item in df$object.2){
if (item == nomatch){next}
if (!(item %in% match13) & !(item %in% match23)){
matrix = rbind(matrix, c(rep(nomatch, 2), item))
}
}
return(matrix)
}
Values in their respective columns are matched with identical values in other columns. FALSE
's are introduced if not all three columns match.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.