[英]Pairwise comparison of dataframe column entries in r
Data:数据:
df_dat = structure(list(code = c(1L, 10000L, 10001L), yr_1986 = c(NA, 10000L, 10001L), yr_1987 = c(NA, 10000L, 10001L), yr_1988 = c(NA, 10000L, 10001L), yr_1989 = c(NA, 10000L, NA), yr_1990 = c(NA, 10000L, NA), yr_1991 = c(1L, 10000L, 10001L), yr_1992 = c(NA, 10000L, 10001L), yr_1993 = c(NA, 10000L, 10001L), yr_1994 = c(NA, 10000L, NA), yr_1995 = c(NA, 10000L, NA), yr_1996 = c(NA, 10000L, NA), yr_1997 = c(NA, 10000L, 10001L), yr_1998 = c(NA, 10000L, 10001L), yr_1999 = c(NA, 10000L, 10001L), yr_2000 = c(NA, 10000L, 10001L), yr_2001 = c(NA, 10000L, NA), yr_2002 = c(NA, 10000L, NA), yr_2003 = c(NA, 10000L, NA), yr_2004 = c(NA, 10000L, NA), yr_2005 = c(NA, 10000L, NA), yr_2006 = c(NA, 10000L, NA), yr_2007 = c(NA, 10000L, NA), yr_2008 = c(NA, 10000L, NA), yr_2009 = c(NA, 10000L, 10001L), yr_2010 = c(NA, 10000L, 10001L), yr_2011 = c(NA, 10000L, 10001L), yr_2012 = c(NA, 10000L, 10001L), yr_2013 = c(NA, 10000L, 10001L), yr_2014 = c(NA, 10000L, NA), yr_2015 = c(NA, 10000L, NA), yr_2016 = c(NA, 10000L, NA), yr_2017 = c(NA, 10000L, NA), yr_2018 = c(NA, 10000L, NA)), .Names = c("code", "yr_1986", "yr_1987", "yr_1988", "yr_1989", "yr_1990", "yr_1991", "yr_1992", "yr_1993", "yr_1994", "yr_1995", "yr_1996", "yr_1997", "yr_1998", "yr_1999", "yr_2000", "yr_2001", "yr_2002", "yr_2003", "yr_2004", "yr_2005", "yr_2006", "yr_2007", "yr_2008", "yr_2009", "yr_2010", "yr_2011", "yr_2012", "yr_2013", "yr_2014", "yr_2015", "yr_2016", "yr_2017", "yr_2018"), class = "data.frame", row.names = c(NA, -3L))
Question: I am trying to perform conditional pairwise comparison across the columns in my dataframe so as to check the reoccurence of the values stored in the first column code , which are numeric codes.问题:我正在尝试对 dataframe 中的列执行条件成对比较,以检查存储在第一列code中的值是否再次出现,这些值是数字代码。 The remaining columns in my case are in fact a time series from 1986-2018.
在我的案例中,其余列实际上是 1986-2018 年的时间序列。 What you see in every year column is in fact the occurence of the codes stored in the code column over time.
您在每年的列中看到的实际上是存储在代码列中的代码随时间的出现。
Now, to the crux of the problem.现在,到了问题的关键。 The objective is to create a new dataframe in which the entries would be populated through conditional statements based on the occurence and disappearance of the values stored in the code column through time.
目标是创建一个新的 dataframe,其中将根据存储在代码列中的值随时间的出现和消失,通过条件语句填充条目。 The expected results should be as follows:
预期结果应如下所示:
Result:结果:
df_out = structure(list(code = c(1L, 10000L), yr_1986 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1987 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1988 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1989 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1990 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1991 = structure(c(2L, 1L), .Label = c("EXIST", "NEW"), class = "factor"), yr_1992 = structure(1:2, .Label = c("CLOSED", "EXIST"), class = "factor"), yr_1993 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1994 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1995 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1996 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1997 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1998 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1999 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2000 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2001 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2002 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2003 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2004 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2005 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2006 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2007 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2008 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2009 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2010 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2011 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2012 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2013 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2014 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2015 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2016 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2017 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2018 = structure(c(NA, 1L), .Label = "EXIST", class = "factor")), .Names = c("code", "yr_1986", "yr_1987", "yr_1988", "yr_1989", "yr_1990", "yr_1991", "yr_1992", "yr_1993", "yr_1994", "yr_1995", "yr_1996", "yr_1997", "yr_1998", "yr_1999", "yr_2000", "yr_2001", "yr_2002", "yr_2003", "yr_2004", "yr_2005", "yr_2006", "yr_2007", "yr_2008", "yr_2009", "yr_2010", "yr_2011", "yr_2012", "yr_2013", "yr_2014", "yr_2015", "yr_2016", "yr_2017", "yr_2018"), class = "data.frame", row.names = c(NA, -2L))
In what follows is a brief description of the mechanics of what I intend to achieve.下面是对我打算实现的机制的简要描述。 The first column code stores the codes of interest.
第一列代码存储感兴趣的代码。 One code per row.
每行一个代码。 The remaining columns are in fact year columns that display as their entries the occurence of the code stored in the code column through time.
其余列实际上是年份列,它们将存储在代码列中的代码随时间的出现显示为它们的条目。
Now, the aim is to check the occurence for each code in the code column through time (ie the year columns) and recode the entries in the output as:现在,目的是通过时间(即年份列)检查代码列中每个代码的出现,并将 output 中的条目重新编码为:
I hope I have managed to describe the problem as clearly as possible.我希望我已经设法尽可能清楚地描述了这个问题。
EDIT: I have managed to find a suboptimal way to solve the problem.编辑:我设法找到了解决问题的次优方法。 This was achieved through splitting the data into two types: 1) type 1 would be to collect all the data for which the codes stored in the code show up some years;
这是通过将数据分成两种类型来实现的:1) 类型 1 将收集存储在代码中的代码在几年内出现的所有数据; 2) type 2 is to collect all the codes that reoccur every year for the period.
2)类型2是收集该期间每年重复出现的所有代码。 In what follows is the code and output based on the sample data I provided.
以下是基于我提供的示例数据的代码和 output。 But again, this is not optimal.
但同样,这不是最优的。
#Load packages
require(tidyverse)
#Select only the year columns in the input data
df_dat_year = df_dat %>%
select(-code)
#Select only the code column for later use
df_dat_code = df_dat %>%
select(code)
#Dataframe including all observations for code=1
df_dat1 = df_dat_year[1:1,]
#Dataframe including all observations for code=10000
df_dat2 = df_dat_year[2:2,]
#Create output dataframes
df_out1 = as.data.frame(matrix(nrow = nrow(df_dat1), ncol = ncol(df_dat1)))
df_out2 = as.data.frame(matrix(nrow = nrow(df_dat2), ncol = ncol(df_dat2)))
#Loop code for each output dataframe
##For output 1
for(i in 1:nrow(df_dat1)) {
for(j in 1:ncol(df_dat1)) {
if((!is.na(df_dat1[i,j])) & (is.na(lead(df_dat1[i,j],1)))) {
df_out1[i,j] = "new"
df_out1[i,j+1] = "closed"
}
}
}
print(df_out1)
##For output 1
for(i in 1:nrow(df_dat2)) {
for(j in 1:ncol(df_dat2)) {
if((!is.na(df_dat2[i,j]))) {
df_out2[i,j] = "exists"
}
}
}
print(df_out2)
Once I have filled out the entries in the output, I just join the dataframes with rbind()
.填写 output 中的条目后,我只需使用
rbind()
加入数据帧。 Subsequently, I add the code column with a cbind()
.随后,我添加了带有
cbind()
的代码列。 Final output looks as follows:最终 output 如下所示:
#Row-binding the output dataframes
df_out = rbind(df_out1,df_out2)
#Adding the code column to the final output dataframe
df_out_fin = cbind(code,df_out)
But again, this is a much messier and convoluted way of solving the problem.但同样,这是解决问题的一种更加混乱和复杂的方式。 Does anyone have a better solution that do not necessitate the multitude of steps I have added?
有没有人有更好的解决方案,不需要我添加的大量步骤?
Here's a tidyverse
approach:这是一个
tidyverse
方法:
library(tidyverse)
df_dat %>%
pivot_longer(-code) %>%
group_by(code) %>%
mutate(value = case_when(
sum(!is.na(value)) == n() ~ "exists",
!is.na(value) & is.na(lag(value)) ~ "new",
is.na(value) & !is.na(lag(value)) ~ "closed",
TRUE ~ NA_character_
)) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = value)
Result结果
# A tibble: 2 x 34
code yr_1986 yr_1987 yr_1988 yr_1989 yr_1990 yr_1991 yr_1992 yr_1993 yr_1994 yr_1995 yr_1996 yr_1997 yr_1998 yr_1999 yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014 yr_2015 yr_2016 yr_2017 yr_2018
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 NA NA NA NA NA new closed NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 10000 exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists exists
If it is by row, an option is also possible with pmap
如果是按行,也可以使用
pmap
选项
library(dplyr)
library(purrr)
pmap_dfr(df_dat[-1], ~ {
tibble(v1 = c(...), v2 = lag(v1)) %>%
transmute(out = case_when(all(!is.na(v1)) ~ 'EXISTS',
!is.na(v1) & is.na(v2) ~ "NEW",
is.na(v1) & !is.na(v2) ~ "CLOSED")) %>%
pull(out) %>%
set_names(names(df_dat)[-1]) }) %>%
bind_cols(df_dat[1],.)
# code yr_1986 yr_1987 yr_1988 yr_1989 yr_1990 yr_1991 yr_1992 yr_1993 yr_1994 yr_1995 yr_1996 yr_1997 yr_1998 yr_1999
#1 1 <NA> <NA> <NA> <NA> <NA> NEW CLOSED <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#2 10000 EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS
# yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014
#1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#2 EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS EXISTS
# yr_2015 yr_2016 yr_2017 yr_2018
#1 <NA> <NA> <NA> <NA>
#2 EXISTS EXISTS EXISTS EXISTS
A simple solution would be to create vectors storing the preceding and following values of your time series:一个简单的解决方案是创建存储时间序列前后值的向量:
x <- !is.na(df_dat[1,-1])
x.prec <- c(NA, x[-length(x)])
x.foll <- c(x[-1], NA)
Then you can find, eg, all NEW flags (values set, but with unset predecessor) with然后你可以找到,例如,所有新标志(值已设置,但前身未设置)
x.new <- x & !x.prec
Similar with CLOSED (last set value) and so on.与 CLOSED(最后设定值)等类似。
Here is a base R approach:这是一个基本的 R 方法:
mat = is.na(df_dat[, -1L])
res = matrix(NA_character_, ncol = ncol(mat), nrow = nrow(mat))
#code = 1 logic:
x = mat[1L, ]
ind_new = which(!x & c(x[-1L], FALSE))
ind_closed = ind_new + 1L
res[1L, c(ind_new, ind_closed)] = rep(c("new", 'closed'), each = length(ind_new))
#code = 10000 logic:
x = mat[2L, ]
res[2L, !x] = "exists"
res
cbind(df_dat[1L], res)
Basically, we are using is.na(df_dat[, -1L])
to evaluate the your logic.基本上,我们使用
is.na(df_dat[, -1L])
来评估您的逻辑。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.