简体   繁体   English

使用 R 从列中提取序列

[英]Extracting sequences from columns using R

I have a df which looks like this我有一个看起来像这样的df

ID  X003-APP    X005-APP    X008-APP    X003-COP    X004-COP    X008-PIN    X009-PIN
363 NA          NA          1           0           NA          4          5
364 0           2           NA          1           5           1          5
678 0           NA          NA          5           NA          NA         NA
713 1           1           1           1           1           1          1
219 1           2           3           3           NA          4          5
234 NA          NA          NA          2           3           NA         NA
321 2           3           1           NA          NA          1          2

I am interested in minimum counts for non-null values across the column substrings APP , COP and PIN .我对跨列子字符串APPCOPPIN的非空值的最小计数感兴趣。 My required output is:我需要的输出是:

ID     APP  COP PIN
363    1    1   1
364    1    1   1
678    1    1   0
713    1    1   1
219    1    1   1
234    0    1   0
321    1    0   1

For reference, I am sharing the dput():作为参考,我正在分享 dput():

structure(list(ID = c(363L, 364L, 678L, 713L, 219L, 234L, 321L), 
X003.APP = c(NA, 0L, 0L, 1L, 1L, NA, 2L), 
X005.APP = c(NA, 2L, NA, 1L, 2L, NA, 3L), 
X008.APP = c(1L, NA, NA, 1L, 3L, NA, 1L), 
X003.COP = c(0L, 1L, 5L, 1L, 3L, 2L, NA), 
X004.COP = c(NA, 5L, NA, 1L, NA, 3L, NA), 
X008.PIN = c(4L, 1L, NA, 1L, 4L, NA, 1L), 
X009.PIN = c(5L, 5L, NA, 1L, 5L, NA, 2L)), 
class = "data.frame", row.names = c(NA, -7L))

Edit:编辑:

Later on, I would like to analyse 2 and 3 sequences across IDs.稍后,我想分析跨 ID 的 2 个和 3 个序列。 For example, I am ultimately, interested in minimum counts for non-null values across the column substrings APP , COP and PIN .例如,我最终对跨列子字符串APPCOPPIN的非空值的最小计数感兴趣。 My ultimate required output for a sequence of length 2 would be:对于长度为 2 的序列,我最终需要的输出是:

Spec_1  Spec_2  Counts
APP     COP     5
APP     PIN     5
COP     PIN     4

Or correspondingly, my required output for a sequence of length 3 would be:或者相应地,我需要的长度为 3 的序列的输出将是:

Spec_1  Spec_2  Spec_3 Counts
APP     COP     PIN    4

Is there an easy way to achieve this?有没有简单的方法来实现这一目标? It would be great to have a solution that could cater for longer sequences - even beyond 3. Thank you very much for your time.如果有一个解决方案可以满足更长的序列 - 甚至超过 3 个序列,那就太好了。非常感谢您抽出宝贵的时间。

You may try你可以试试

library(reshape2)
library(tidyverse)

df %>%
  reshape2::melt(id = "ID") %>%
  separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
  group_by(ID, Spec) %>%
  summarize(value = as.numeric(any(!is.na(value)))) %>%
  filter(value == 1) %>%
  pivot_wider(names_from = "Spec", values_from = "value") %>%
  replace(is.na(.), 0)

     ID   APP   COP   PIN
  <int> <dbl> <dbl> <dbl>
1   219     1     1     1
2   234     0     1     0
3   321     1     0     1
4   363     1     1     1
5   364     1     1     1
6   678     1     1     0
7   713     1     1     1

Is your edited one and是您编辑过的吗?

df %>%
  reshape2::melt(id = "ID") %>%
  separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
  group_by(ID, Spec) %>%
  summarize(value = any(!is.na(value))) %>%
  filter(value) %>%
  group_by(ID) %>%
  filter(n() > 1) %>%
  summarise(Spec = combn(Spec, 2, simplify = F)) %>%
  unnest_wider(Spec, names_sep = "_") %>%
  group_by(Spec_1, Spec_2) %>%
  summarize(Counts = n())

  Spec_1 Spec_2 Counts
  <chr>  <chr>   <int>
1 APP    COP         5
2 APP    PIN         5
3 COP    PIN         4

is your previous one.是你的前一个。

3 seq? 3序?

df %>%
  reshape2::melt(id = "ID") %>%
  separate(variable, into = c("a", "Spec"), sep = "\\.") %>%
  group_by(ID, Spec) %>%
  summarize(value = any(!is.na(value))) %>%
  filter(value) %>%
  group_by(ID) %>%
  filter(n() > 2) %>%
  summarise(Spec = combn(Spec, 3, simplify = F)) %>%
  unnest_wider(Spec, names_sep = "_") %>%
  group_by(Spec_1, Spec_2, Spec_3) %>%
  summarize(Counts = n())

  Spec_1 Spec_2 Spec_3 Counts
  <chr>  <chr>  <chr>   <int>
1 APP    COP    PIN         4

Try this using dplyr尝试使用dplyr

library(dplyr)

df |> rowwise() |> transmute( ID,
APP = case_when(all(is.na(c_across(contains("APP")))) ~ 0 , TRUE ~ 1) ,
COP = case_when(all(is.na(c_across(contains("COP")))) ~ 0 , TRUE ~ 1) , 
PIN = case_when(all(is.na(c_across(contains("PIN")))) ~ 0 , TRUE ~ 1)) -> df1
  • output输出
# A tibble: 7 × 4
# Rowwise: 
     ID   APP   COP   PIN
  <int> <dbl> <dbl> <dbl>
1   363     1     1     1
2   364     1     1     1
3   678     1     1     0
4   713     1     1     1
5   219     1     1     1
6   234     0     1     0
7   321     1     0     1
  • for your second required you can use对于您第二次需要,您可以使用
df1 |> transmute(AC = case_when(sum(c_across(c(APP,COP))) == 2 ~ 1 , TRUE ~ 0) ,
AP = case_when(sum(c_across(c(APP,PIN))) == 2 ~ 1 , TRUE ~ 0) ,
CP = case_when(sum(c_across(c(PIN,COP))) == 2 ~ 1 , TRUE ~ 0) ,
ACP = case_when(sum(c_across(c(APP,COP,PIN))) == 3 ~ 1 , TRUE ~ 0)) |> ungroup() |> 
summarise(APP_COP = sum(AC) , APP_PIN = sum(AP) , COP_PIN = sum(CP) ,  APP_COP_PIN = sum(ACP))
  • output输出
# A tibble: 1 × 4
  APP_COP APP_PIN COP_PIN APP_COP_PIN
    <dbl>   <dbl>   <dbl>       <dbl>
1       5       5       4           4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM