简体   繁体   中英

Pairwise comparison of dataframe column entries in r

Data:

df_dat = structure(list(code = c(1L, 10000L, 10001L), yr_1986 = c(NA, 10000L, 10001L), yr_1987 = c(NA, 10000L, 10001L), yr_1988 = c(NA, 10000L, 10001L), yr_1989 = c(NA, 10000L, NA), yr_1990 = c(NA, 10000L, NA), yr_1991 = c(1L, 10000L, 10001L), yr_1992 = c(NA, 10000L, 10001L), yr_1993 = c(NA, 10000L, 10001L), yr_1994 = c(NA, 10000L, NA), yr_1995 = c(NA, 10000L, NA), yr_1996 = c(NA, 10000L, NA), yr_1997 = c(NA, 10000L, 10001L), yr_1998 = c(NA, 10000L, 10001L), yr_1999 = c(NA, 10000L, 10001L), yr_2000 = c(NA, 10000L, 10001L), yr_2001 = c(NA, 10000L, NA), yr_2002 = c(NA, 10000L, NA), yr_2003 = c(NA, 10000L, NA), yr_2004 = c(NA, 10000L, NA), yr_2005 = c(NA, 10000L, NA), yr_2006 = c(NA, 10000L, NA), yr_2007 = c(NA, 10000L, NA), yr_2008 = c(NA, 10000L, NA), yr_2009 = c(NA, 10000L, 10001L), yr_2010 = c(NA, 10000L, 10001L), yr_2011 = c(NA, 10000L, 10001L), yr_2012 = c(NA, 10000L, 10001L), yr_2013 = c(NA, 10000L, 10001L), yr_2014 = c(NA, 10000L, NA), yr_2015 = c(NA, 10000L, NA), yr_2016 = c(NA, 10000L, NA), yr_2017 = c(NA, 10000L, NA), yr_2018 = c(NA, 10000L, NA)), .Names = c("code", "yr_1986", "yr_1987", "yr_1988", "yr_1989", "yr_1990", "yr_1991", "yr_1992", "yr_1993", "yr_1994", "yr_1995", "yr_1996", "yr_1997", "yr_1998", "yr_1999", "yr_2000", "yr_2001", "yr_2002", "yr_2003", "yr_2004", "yr_2005", "yr_2006", "yr_2007", "yr_2008", "yr_2009", "yr_2010", "yr_2011", "yr_2012", "yr_2013", "yr_2014", "yr_2015", "yr_2016", "yr_2017", "yr_2018"), class = "data.frame", row.names = c(NA, -3L))

Question: I am trying to perform conditional pairwise comparison across the columns in my dataframe so as to check the reoccurence of the values stored in the first column code , which are numeric codes. The remaining columns in my case are in fact a time series from 1986-2018. What you see in every year column is in fact the occurence of the codes stored in the code column over time.

Now, to the crux of the problem. The objective is to create a new dataframe in which the entries would be populated through conditional statements based on the occurence and disappearance of the values stored in the code column through time. The expected results should be as follows:

Result:

df_out = structure(list(code = c(1L, 10000L), yr_1986 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1987 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1988 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1989 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1990 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1991 = structure(c(2L, 1L), .Label = c("EXIST", "NEW"), class = "factor"), yr_1992 = structure(1:2, .Label = c("CLOSED", "EXIST"), class = "factor"), yr_1993 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1994 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1995 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1996 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1997 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1998 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1999 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2000 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2001 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2002 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2003 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2004 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2005 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2006 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2007 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2008 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2009 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2010 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2011 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2012 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2013 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2014 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2015 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2016 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2017 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2018 = structure(c(NA, 1L), .Label = "EXIST", class = "factor")), .Names = c("code", "yr_1986", "yr_1987", "yr_1988", "yr_1989", "yr_1990", "yr_1991", "yr_1992", "yr_1993", "yr_1994", "yr_1995", "yr_1996", "yr_1997", "yr_1998", "yr_1999", "yr_2000", "yr_2001", "yr_2002", "yr_2003", "yr_2004", "yr_2005", "yr_2006", "yr_2007", "yr_2008", "yr_2009", "yr_2010", "yr_2011", "yr_2012", "yr_2013", "yr_2014", "yr_2015", "yr_2016", "yr_2017", "yr_2018"), class = "data.frame", row.names = c(NA, -2L))

In what follows is a brief description of the mechanics of what I intend to achieve. The first column code stores the codes of interest. One code per row. The remaining columns are in fact year columns that display as their entries the occurence of the code stored in the code column through time.

Now, the aim is to check the occurence for each code in the code column through time (ie the year columns) and recode the entries in the output as:

  • NEW for the first year(t) of occurrence;
  • CLOSED if the code stops reoccurring at year(t+1) efter having occurred in year(t);
  • EXIST if the code keeps recurring for all years.

I hope I have managed to describe the problem as clearly as possible.

EDIT: I have managed to find a suboptimal way to solve the problem. This was achieved through splitting the data into two types: 1) type 1 would be to collect all the data for which the codes stored in the code show up some years; 2) type 2 is to collect all the codes that reoccur every year for the period. In what follows is the code and output based on the sample data I provided. But again, this is not optimal.

#Load packages
require(tidyverse)

#Select only the year columns in the input data
df_dat_year = df_dat %>%
select(-code)

#Select only the code column for later use
df_dat_code = df_dat %>%
select(code)

#Dataframe including all observations for code=1
df_dat1 = df_dat_year[1:1,]

#Dataframe including all observations for code=10000
df_dat2 = df_dat_year[2:2,]

#Create output dataframes
df_out1 = as.data.frame(matrix(nrow = nrow(df_dat1), ncol = ncol(df_dat1)))
df_out2 = as.data.frame(matrix(nrow = nrow(df_dat2), ncol = ncol(df_dat2)))

#Loop code for each output dataframe

##For output 1
for(i in 1:nrow(df_dat1)) {
for(j in 1:ncol(df_dat1)) {
if((!is.na(df_dat1[i,j])) & (is.na(lead(df_dat1[i,j],1)))) {
df_out1[i,j] = "new"
df_out1[i,j+1] = "closed"
}
}
}
print(df_out1) 

##For output 1
for(i in 1:nrow(df_dat2)) {
for(j in 1:ncol(df_dat2)) {
if((!is.na(df_dat2[i,j]))) {
df_out2[i,j] = "exists"
}
}
}
print(df_out2)

Once I have filled out the entries in the output, I just join the dataframes with rbind() . Subsequently, I add the code column with a cbind() . Final output looks as follows:

#Row-binding the output dataframes
df_out = rbind(df_out1,df_out2)

#Adding the code column to the final output dataframe
df_out_fin = cbind(code,df_out)

But again, this is a much messier and convoluted way of solving the problem. Does anyone have a better solution that do not necessitate the multitude of steps I have added?

Here's a tidyverse approach:

library(tidyverse)
df_dat %>% 
  pivot_longer(-code) %>%
  group_by(code) %>%
  mutate(value = case_when(
    sum(!is.na(value)) == n() ~ "exists",
    !is.na(value) & is.na(lag(value)) ~ "new",
    is.na(value) & !is.na(lag(value)) ~ "closed",
    TRUE ~ NA_character_
  )) %>%
  ungroup() %>%
  pivot_wider(names_from = name, values_from = value)

Result

# A tibble: 2 x 34
   code yr_1986 yr_1987 yr_1988 yr_1989 yr_1990 yr_1991 yr_1992 yr_1993 yr_1994 yr_1995 yr_1996 yr_1997 yr_1998 yr_1999 yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014 yr_2015 yr_2016 yr_2017 yr_2018
  <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1     1 NA      NA      NA      NA      NA      new     closed  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA     
2 10000 exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists 

If it is by row, an option is also possible with pmap

library(dplyr)
library(purrr)
pmap_dfr(df_dat[-1], ~ {
    tibble(v1 = c(...), v2 = lag(v1)) %>%
    transmute(out = case_when(all(!is.na(v1))  ~ 'EXISTS',
            !is.na(v1) & is.na(v2) ~ "NEW", 
            is.na(v1) & !is.na(v2) ~ "CLOSED")) %>%
    pull(out) %>% 
    set_names(names(df_dat)[-1])  }) %>%
    bind_cols(df_dat[1],.)
#   code yr_1986 yr_1987 yr_1988 yr_1989 yr_1990 yr_1991 yr_1992 yr_1993 yr_1994 yr_1995 yr_1996 yr_1997 yr_1998 yr_1999
#1     1    <NA>    <NA>    <NA>    <NA>    <NA>     NEW  CLOSED    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
#2 10000  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS
#  yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014
#1    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
#2  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS
#  yr_2015 yr_2016 yr_2017 yr_2018
#1    <NA>    <NA>    <NA>    <NA>
#2  EXISTS  EXISTS  EXISTS  EXISTS

A simple solution would be to create vectors storing the preceding and following values of your time series:

x <- !is.na(df_dat[1,-1])
x.prec <- c(NA, x[-length(x)])
x.foll <- c(x[-1], NA)

Then you can find, eg, all NEW flags (values set, but with unset predecessor) with

x.new <- x & !x.prec

Similar with CLOSED (last set value) and so on.

Here is a base R approach:

mat = is.na(df_dat[, -1L])
res = matrix(NA_character_, ncol = ncol(mat), nrow = nrow(mat))

#code = 1 logic:
x = mat[1L, ]
ind_new = which(!x & c(x[-1L], FALSE))
ind_closed = ind_new + 1L
res[1L, c(ind_new, ind_closed)] = rep(c("new", 'closed'), each = length(ind_new))

#code = 10000 logic:
x = mat[2L, ]
res[2L, !x] = "exists"

res
cbind(df_dat[1L], res)

Basically, we are using is.na(df_dat[, -1L]) to evaluate the your logic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM