I am using R
set.seed(1)
Data <- data.frame(id = seq(1, 10),
Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE),
Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE),
Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE),
Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE),
Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE),
Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE),
Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
Data
I've got a data frame like this. In reality it has 34 variables and 1.5 Mio observations. It is a data frame with patient data. (ID & diagnoses (ICD10) A123 and B123 stand for certain diagnoses. I want to extract all the patients with these diagnoses. In fact i am looking for 6 diagnoses within 100s of different ICD10 diagnoses. Every of those diagnoses i look for can be appear in any column but they are mutually exclusive. In the end I will have a data frame of about 4000 observations instead of 1.5 Mio.
My goal is to get a data frame where I just keep the rows which contain A123 or B123. A123 and B123 cannot be in the same row. But they can appear in every column.
I manage to do that for one single variable when i do this:
DataA123 <- Data[Data$Diag1 == "A123", ]
But i want to do it for every variable and for A123 and B123 (there are actually 6 factors like this) together.
Is this possible?
How about this?
Select all rows with A123 and/or B123:
Data[apply(Data,1,function(x) {any(c("A123", "B123") %in% x)}),]
Select all rows with either A123 or B123:
Data[apply(Data,1,function(x) {Reduce(xor, c("A123", "B123") %in% x)}),]
If I understand your question correctly, you might be able to use something like:
Data[rowSums(cbind(rowSums(Data == "A123"),
rowSums(Data == "B123")) != 0) == 1, ]
(But I'm not sure how efficient it would be for your actual data, especially because you have to make several intermediate large matrices).
The basic idea is as follows:
rowSums(Data == "A123")
tells us how many times "A123" appears in each row. rowSums(Data == "B123")
tells us how many times "B123" appears in each row. cbind
puts the two of them together as a two column matrix. rowSums
again to find out how many rows have only one of those present (even if it is present more than once). Here's an example:
set.seed(1)
Data <- data.frame(id = seq(1, 10),
Diag1 = sample(c("A123", "B123", "C123"), 10, replace = TRUE),
Diag2 = sample(c("D123", "E123", "F123"), 10, replace = TRUE),
Diag3 = sample(c("G123", "H123", "I123"), 10, replace = TRUE),
Diag4 = sample(c("A123", "B123", "C123"), 10, replace = TRUE),
Diag5 = sample(c("J123", "K123", "L123"), 10, replace = TRUE),
Diag6 = sample(c("M123", "N123", "O123"), 10, replace = TRUE),
Diag7 = sample(c("P123", "Q123", "R123"), 10, replace = TRUE))
Data
# id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
# 1 1 A123 D123 I123 B123 L123 N123 R123
# 2 2 B123 D123 G123 B123 K123 O123 P123
# 3 3 B123 F123 H123 B123 L123 N123 Q123
# 4 4 C123 E123 G123 A123 K123 M123 P123
# 5 5 A123 F123 G123 C123 K123 M123 Q123
# 6 6 C123 E123 H123 C123 L123 M123 P123
# 7 7 C123 F123 G123 C123 J123 M123 Q123
# 8 8 B123 F123 H123 A123 K123 N123 R123
# 9 9 B123 E123 I123 C123 L123 N123 P123
# 10 10 A123 F123 H123 B123 L123 N123 R123
Data[rowSums(cbind(rowSums(Data == "A123"),
rowSums(Data == "B123")) != 0) == 1, ]
# id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
# 2 2 B123 D123 G123 B123 K123 O123 P123
# 3 3 B123 F123 H123 B123 L123 N123 Q123
# 4 4 C123 E123 G123 A123 K123 M123 P123
# 5 5 A123 F123 G123 C123 K123 M123 Q123
# 9 9 B123 E123 I123 C123 L123 N123 P123
Note that from the source 10-row data.frame
set.seed(1)
ll<-as.list(names(Data)[-1])
For A123:
Map(function(x) Data[Data[x][[1]]=="A123",],ll)
[[1]]
id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1 1 A123 D123 I123 B123 L123 N123 R123
5 5 A123 F123 G123 C123 K123 M123 Q123
10 10 A123 F123 H123 B123 L123 N123 R123
[[2]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[3]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[4]]
id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
4 4 C123 E123 G123 A123 K123 M123 P123
8 8 B123 F123 H123 A123 K123 N123 R123
[[5]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[6]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[7]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
For B123:
Map(function(x) Data[Data[x][[1]]=="B123",],ll)
[[1]]
id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
2 2 B123 D123 G123 B123 K123 O123 P123
3 3 B123 F123 H123 B123 L123 N123 Q123
8 8 B123 F123 H123 A123 K123 N123 R123
9 9 B123 E123 I123 C123 L123 N123 P123
[[2]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[3]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[4]]
id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1 1 A123 D123 I123 B123 L123 N123 R123
2 2 B123 D123 G123 B123 K123 O123 P123
3 3 B123 F123 H123 B123 L123 N123 Q123
10 10 A123 F123 H123 B123 L123 N123 R123
[[5]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[6]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[7]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
For A123 or B123:
Map(function(x) Data[Data[x][[1]]=="A123"|Data[x][[1]]=="B123",],ll)
[[1]]
id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1 1 A123 D123 I123 B123 L123 N123 R123
2 2 B123 D123 G123 B123 K123 O123 P123
3 3 B123 F123 H123 B123 L123 N123 Q123
5 5 A123 F123 G123 C123 K123 M123 Q123
8 8 B123 F123 H123 A123 K123 N123 R123
9 9 B123 E123 I123 C123 L123 N123 P123
10 10 A123 F123 H123 B123 L123 N123 R123
[[2]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[3]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[4]]
id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
1 1 A123 D123 I123 B123 L123 N123 R123
2 2 B123 D123 G123 B123 K123 O123 P123
3 3 B123 F123 H123 B123 L123 N123 Q123
4 4 C123 E123 G123 A123 K123 M123 P123
8 8 B123 F123 H123 A123 K123 N123 R123
10 10 A123 F123 H123 B123 L123 N123 R123
[[5]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[6]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
[[7]]
[1] id Diag1 Diag2 Diag3 Diag4 Diag5 Diag6 Diag7
<0 rows> (or 0-length row.names)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.