简体   繁体   中英

Keep rows with certain values within a data frame and delete all others [R]

I am using R

set.seed(1)
Data <- data.frame(id = seq(1, 10), 
               Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE), 
               Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE), 
               Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE), 
               Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE), 
               Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
Data

I've got a data frame like this. In reality it has 34 variables and 1.5 Mio observations. It is a data frame with patient data. (ID & diagnoses (ICD10) A123 and B123 stand for certain diagnoses. I want to extract all the patients with these diagnoses. In fact i am looking for 6 diagnoses within 100s of different ICD10 diagnoses. Every of those diagnoses i look for can be appear in any column but they are mutually exclusive. In the end I will have a data frame of about 4000 observations instead of 1.5 Mio.

My goal is to get a data frame where I just keep the rows which contain A123 or B123. A123 and B123 cannot be in the same row. But they can appear in every column.

I manage to do that for one single variable when i do this:

DataA123 <- Data[Data$Diag1 == "A123", ]

But i want to do it for every variable and for A123 and B123 (there are actually 6 factors like this) together.

Is this possible?

How about this?

Select all rows with A123 and/or B123:

Data[apply(Data,1,function(x) {any(c("A123", "B123") %in% x)}),]

Select all rows with either A123 or B123:

Data[apply(Data,1,function(x) {Reduce(xor, c("A123", "B123") %in% x)}),]

If I understand your question correctly, you might be able to use something like:

Data[rowSums(cbind(rowSums(Data == "A123"), 
                   rowSums(Data == "B123")) != 0) == 1, ]

(But I'm not sure how efficient it would be for your actual data, especially because you have to make several intermediate large matrices).


The basic idea is as follows:

  • rowSums(Data == "A123") tells us how many times "A123" appears in each row.
  • rowSums(Data == "B123") tells us how many times "B123" appears in each row.
  • cbind puts the two of them together as a two column matrix.
  • Since "A123" and "B123" can't be in the same row, we calculate rowSums again to find out how many rows have only one of those present (even if it is present more than once).
  • From there, it's basic subsetting.

Here's an example:

set.seed(1)
Data <- data.frame(id = seq(1, 10), 
               Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE), 
               Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE), 
               Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE), 
               Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE), 
               Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE), 
               Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
Data
#    id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
# 1   1  A123  D123  I123  B123  L123  N123  R123
# 2   2  B123  D123  G123  B123  K123  O123  P123
# 3   3  B123  F123  H123  B123  L123  N123  Q123
# 4   4  C123  E123  G123  A123  K123  M123  P123
# 5   5  A123  F123  G123  C123  K123  M123  Q123
# 6   6  C123  E123  H123  C123  L123  M123  P123
# 7   7  C123  F123  G123  C123  J123  M123  Q123
# 8   8  B123  F123  H123  A123  K123  N123  R123
# 9   9  B123  E123  I123  C123  L123  N123  P123
# 10 10  A123  F123  H123  B123  L123  N123  R123

Data[rowSums(cbind(rowSums(Data == "A123"), 
                   rowSums(Data == "B123")) != 0) == 1, ]
#   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
# 2  2  B123  D123  G123  B123  K123  O123  P123
# 3  3  B123  F123  H123  B123  L123  N123  Q123
# 4  4  C123  E123  G123  A123  K123  M123  P123
# 5  5  A123  F123  G123  C123  K123  M123  Q123
# 9  9  B123  E123  I123  C123  L123  N123  P123

Note that from the source 10-row data.frame

  • rows 1, 8, and 10 were dropped because they contained both "A123" and "B123".
  • rows 6 and 7 were dropped because they contained neither "A123" nor "B123".
set.seed(1)

  ll<-as.list(names(Data)[-1])

For A123:

Map(function(x) Data[Data[x][[1]]=="A123",],ll)


  [[1]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
5   5  A123  F123  G123  C123  K123  M123  Q123
10 10  A123  F123  H123  B123  L123  N123  R123

[[2]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[3]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[4]]
  id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
4  4  C123  E123  G123  A123  K123  M123  P123
8  8  B123  F123  H123  A123  K123  N123  R123

[[5]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[6]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[7]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

For B123:

Map(function(x) Data[Data[x][[1]]=="B123",],ll)



 [[1]]
  id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
2  2  B123  D123  G123  B123  K123  O123  P123
3  3  B123  F123  H123  B123  L123  N123  Q123
8  8  B123  F123  H123  A123  K123  N123  R123
9  9  B123  E123  I123  C123  L123  N123  P123

[[2]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[3]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[4]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
2   2  B123  D123  G123  B123  K123  O123  P123
3   3  B123  F123  H123  B123  L123  N123  Q123
10 10  A123  F123  H123  B123  L123  N123  R123

[[5]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[6]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[7]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

For A123 or B123:

 Map(function(x) Data[Data[x][[1]]=="A123"|Data[x][[1]]=="B123",],ll)



 [[1]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
2   2  B123  D123  G123  B123  K123  O123  P123
3   3  B123  F123  H123  B123  L123  N123  Q123
5   5  A123  F123  G123  C123  K123  M123  Q123
8   8  B123  F123  H123  A123  K123  N123  R123
9   9  B123  E123  I123  C123  L123  N123  P123
10 10  A123  F123  H123  B123  L123  N123  R123

[[2]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[3]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[4]]
   id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1   1  A123  D123  I123  B123  L123  N123  R123
2   2  B123  D123  G123  B123  K123  O123  P123
3   3  B123  F123  H123  B123  L123  N123  Q123
4   4  C123  E123  G123  A123  K123  M123  P123
8   8  B123  F123  H123  A123  K123  N123  R123
10 10  A123  F123  H123  B123  L123  N123  R123

[[5]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[6]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

[[7]]
[1] id    Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM