简体   繁体   English

r 中 dataframe 列条目的成对比较

[英]Pairwise comparison of dataframe column entries in r

Data:数据:

df_dat = structure(list(code = c(1L, 10000L, 10001L), yr_1986 = c(NA, 10000L, 10001L), yr_1987 = c(NA, 10000L, 10001L), yr_1988 = c(NA, 10000L, 10001L), yr_1989 = c(NA, 10000L, NA), yr_1990 = c(NA, 10000L, NA), yr_1991 = c(1L, 10000L, 10001L), yr_1992 = c(NA, 10000L, 10001L), yr_1993 = c(NA, 10000L, 10001L), yr_1994 = c(NA, 10000L, NA), yr_1995 = c(NA, 10000L, NA), yr_1996 = c(NA, 10000L, NA), yr_1997 = c(NA, 10000L, 10001L), yr_1998 = c(NA, 10000L, 10001L), yr_1999 = c(NA, 10000L, 10001L), yr_2000 = c(NA, 10000L, 10001L), yr_2001 = c(NA, 10000L, NA), yr_2002 = c(NA, 10000L, NA), yr_2003 = c(NA, 10000L, NA), yr_2004 = c(NA, 10000L, NA), yr_2005 = c(NA, 10000L, NA), yr_2006 = c(NA, 10000L, NA), yr_2007 = c(NA, 10000L, NA), yr_2008 = c(NA, 10000L, NA), yr_2009 = c(NA, 10000L, 10001L), yr_2010 = c(NA, 10000L, 10001L), yr_2011 = c(NA, 10000L, 10001L), yr_2012 = c(NA, 10000L, 10001L), yr_2013 = c(NA, 10000L, 10001L), yr_2014 = c(NA, 10000L, NA), yr_2015 = c(NA, 10000L, NA), yr_2016 = c(NA, 10000L, NA), yr_2017 = c(NA, 10000L, NA), yr_2018 = c(NA, 10000L, NA)), .Names = c("code", "yr_1986", "yr_1987", "yr_1988", "yr_1989", "yr_1990", "yr_1991", "yr_1992", "yr_1993", "yr_1994", "yr_1995", "yr_1996", "yr_1997", "yr_1998", "yr_1999", "yr_2000", "yr_2001", "yr_2002", "yr_2003", "yr_2004", "yr_2005", "yr_2006", "yr_2007", "yr_2008", "yr_2009", "yr_2010", "yr_2011", "yr_2012", "yr_2013", "yr_2014", "yr_2015", "yr_2016", "yr_2017", "yr_2018"), class = "data.frame", row.names = c(NA, -3L))

Question: I am trying to perform conditional pairwise comparison across the columns in my dataframe so as to check the reoccurence of the values stored in the first column code , which are numeric codes.问题:我正在尝试对 dataframe 中的列执行条件成对比较,以检查存储在第一列code中的值是否再次出现,这些值是数字代码。 The remaining columns in my case are in fact a time series from 1986-2018.在我的案例中,其余列实际上是 1986-2018 年的时间序列。 What you see in every year column is in fact the occurence of the codes stored in the code column over time.您在每年的列中看到的实际上是存储在代码列中的代码随时间的出现。

Now, to the crux of the problem.现在,到了问题的关键。 The objective is to create a new dataframe in which the entries would be populated through conditional statements based on the occurence and disappearance of the values stored in the code column through time.目标是创建一个新的 dataframe,其中将根据存储在代码列中的值随时间的出现和消失,通过条件语句填充条目。 The expected results should be as follows:预期结果应如下所示:

Result:结果:

df_out = structure(list(code = c(1L, 10000L), yr_1986 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1987 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1988 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1989 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1990 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1991 = structure(c(2L, 1L), .Label = c("EXIST", "NEW"), class = "factor"), yr_1992 = structure(1:2, .Label = c("CLOSED", "EXIST"), class = "factor"), yr_1993 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1994 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1995 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1996 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1997 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1998 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_1999 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2000 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2001 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2002 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2003 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2004 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2005 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2006 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2007 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2008 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2009 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2010 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2011 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2012 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2013 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2014 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2015 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2016 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2017 = structure(c(NA, 1L), .Label = "EXIST", class = "factor"), yr_2018 = structure(c(NA, 1L), .Label = "EXIST", class = "factor")), .Names = c("code", "yr_1986", "yr_1987", "yr_1988", "yr_1989", "yr_1990", "yr_1991", "yr_1992", "yr_1993", "yr_1994", "yr_1995", "yr_1996", "yr_1997", "yr_1998", "yr_1999", "yr_2000", "yr_2001", "yr_2002", "yr_2003", "yr_2004", "yr_2005", "yr_2006", "yr_2007", "yr_2008", "yr_2009", "yr_2010", "yr_2011", "yr_2012", "yr_2013", "yr_2014", "yr_2015", "yr_2016", "yr_2017", "yr_2018"), class = "data.frame", row.names = c(NA, -2L))

In what follows is a brief description of the mechanics of what I intend to achieve.下面是对我打算实现的机制的简要描述。 The first column code stores the codes of interest.第一列代码存储感兴趣的代码。 One code per row.每行一个代码。 The remaining columns are in fact year columns that display as their entries the occurence of the code stored in the code column through time.其余列实际上是年份列,它们将存储在代码列中的代码随时间的出现显示为它们的条目。

Now, the aim is to check the occurence for each code in the code column through time (ie the year columns) and recode the entries in the output as:现在,目的是通过时间(即年份列)检查代码列中每个代码的出现,并将 output 中的条目重新编码为:

  • NEW for the first year(t) of occurrence;发生的第一年(t);
  • CLOSED if the code stops reoccurring at year(t+1) efter having occurred in year(t);如果代码在第 (t) 年发生后在第 (t+1) 年停止再次出现,则关闭
  • EXIST if the code keeps recurring for all years.如果代码多年来一直重复出现,则存在。

I hope I have managed to describe the problem as clearly as possible.我希望我已经设法尽可能清楚地描述了这个问题。

EDIT: I have managed to find a suboptimal way to solve the problem.编辑:我设法找到了解决问题的次优方法。 This was achieved through splitting the data into two types: 1) type 1 would be to collect all the data for which the codes stored in the code show up some years;这是通过将数据分成两种类型来实现的:1) 类型 1 将收集存储在代码中的代码在几年内出现的所有数据; 2) type 2 is to collect all the codes that reoccur every year for the period. 2)类型2是收集该期间每年重复出现的所有代码。 In what follows is the code and output based on the sample data I provided.以下是基于我提供的示例数据的代码和 output。 But again, this is not optimal.但同样,这不是最优的。

#Load packages
require(tidyverse)

#Select only the year columns in the input data
df_dat_year = df_dat %>%
select(-code)

#Select only the code column for later use
df_dat_code = df_dat %>%
select(code)

#Dataframe including all observations for code=1
df_dat1 = df_dat_year[1:1,]

#Dataframe including all observations for code=10000
df_dat2 = df_dat_year[2:2,]

#Create output dataframes
df_out1 = as.data.frame(matrix(nrow = nrow(df_dat1), ncol = ncol(df_dat1)))
df_out2 = as.data.frame(matrix(nrow = nrow(df_dat2), ncol = ncol(df_dat2)))

#Loop code for each output dataframe

##For output 1
for(i in 1:nrow(df_dat1)) {
for(j in 1:ncol(df_dat1)) {
if((!is.na(df_dat1[i,j])) & (is.na(lead(df_dat1[i,j],1)))) {
df_out1[i,j] = "new"
df_out1[i,j+1] = "closed"
}
}
}
print(df_out1) 

##For output 1
for(i in 1:nrow(df_dat2)) {
for(j in 1:ncol(df_dat2)) {
if((!is.na(df_dat2[i,j]))) {
df_out2[i,j] = "exists"
}
}
}
print(df_out2)

Once I have filled out the entries in the output, I just join the dataframes with rbind() .填写 output 中的条目后,我只需使用rbind()加入数据帧。 Subsequently, I add the code column with a cbind() .随后,我添加了带有cbind()的代码列。 Final output looks as follows:最终 output 如下所示:

#Row-binding the output dataframes
df_out = rbind(df_out1,df_out2)

#Adding the code column to the final output dataframe
df_out_fin = cbind(code,df_out)

But again, this is a much messier and convoluted way of solving the problem.但同样,这是解决问题的一种更加混乱和复杂的方式。 Does anyone have a better solution that do not necessitate the multitude of steps I have added?有没有人有更好的解决方案,不需要我添加的大量步骤?

Here's a tidyverse approach:这是一个tidyverse方法:

library(tidyverse)
df_dat %>% 
  pivot_longer(-code) %>%
  group_by(code) %>%
  mutate(value = case_when(
    sum(!is.na(value)) == n() ~ "exists",
    !is.na(value) & is.na(lag(value)) ~ "new",
    is.na(value) & !is.na(lag(value)) ~ "closed",
    TRUE ~ NA_character_
  )) %>%
  ungroup() %>%
  pivot_wider(names_from = name, values_from = value)

Result结果

# A tibble: 2 x 34
   code yr_1986 yr_1987 yr_1988 yr_1989 yr_1990 yr_1991 yr_1992 yr_1993 yr_1994 yr_1995 yr_1996 yr_1997 yr_1998 yr_1999 yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014 yr_2015 yr_2016 yr_2017 yr_2018
  <int> <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1     1 NA      NA      NA      NA      NA      new     closed  NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA      NA     
2 10000 exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists  exists 

If it is by row, an option is also possible with pmap如果是按行,也可以使用pmap选项

library(dplyr)
library(purrr)
pmap_dfr(df_dat[-1], ~ {
    tibble(v1 = c(...), v2 = lag(v1)) %>%
    transmute(out = case_when(all(!is.na(v1))  ~ 'EXISTS',
            !is.na(v1) & is.na(v2) ~ "NEW", 
            is.na(v1) & !is.na(v2) ~ "CLOSED")) %>%
    pull(out) %>% 
    set_names(names(df_dat)[-1])  }) %>%
    bind_cols(df_dat[1],.)
#   code yr_1986 yr_1987 yr_1988 yr_1989 yr_1990 yr_1991 yr_1992 yr_1993 yr_1994 yr_1995 yr_1996 yr_1997 yr_1998 yr_1999
#1     1    <NA>    <NA>    <NA>    <NA>    <NA>     NEW  CLOSED    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
#2 10000  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS
#  yr_2000 yr_2001 yr_2002 yr_2003 yr_2004 yr_2005 yr_2006 yr_2007 yr_2008 yr_2009 yr_2010 yr_2011 yr_2012 yr_2013 yr_2014
#1    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
#2  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS  EXISTS
#  yr_2015 yr_2016 yr_2017 yr_2018
#1    <NA>    <NA>    <NA>    <NA>
#2  EXISTS  EXISTS  EXISTS  EXISTS

A simple solution would be to create vectors storing the preceding and following values of your time series:一个简单的解决方案是创建存储时间序列前后值的向量:

x <- !is.na(df_dat[1,-1])
x.prec <- c(NA, x[-length(x)])
x.foll <- c(x[-1], NA)

Then you can find, eg, all NEW flags (values set, but with unset predecessor) with然后你可以找到,例如,所有新标志(值已设置,但前身未设置)

x.new <- x & !x.prec

Similar with CLOSED (last set value) and so on.与 CLOSED(最后设定值)等类似。

Here is a base R approach:这是一个基本的 R 方法:

mat = is.na(df_dat[, -1L])
res = matrix(NA_character_, ncol = ncol(mat), nrow = nrow(mat))

#code = 1 logic:
x = mat[1L, ]
ind_new = which(!x & c(x[-1L], FALSE))
ind_closed = ind_new + 1L
res[1L, c(ind_new, ind_closed)] = rep(c("new", 'closed'), each = length(ind_new))

#code = 10000 logic:
x = mat[2L, ]
res[2L, !x] = "exists"

res
cbind(df_dat[1L], res)

Basically, we are using is.na(df_dat[, -1L]) to evaluate the your logic.基本上,我们使用is.na(df_dat[, -1L])来评估您的逻辑。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM