[英]Finding frequencies of all possible pairs in R
I'm working with a large dataset of drugs and reactions using R. For now, I have the data structured as a very tall data frame that lists the report ID number, the Drug name, and the reported reactions. 我正在使用R处理大量的药物和反应数据集。目前,我将数据结构化为一个非常高的数据框,其中列出了报告ID号,药物名称和报告的反应。 As you can tell, there is a one-to-many relationship between both IDs vs. drugs and drugs vs. reactions.
如您所知,ID与药物之间以及药物与反应之间存在一对多的关系。
Keeping in mind that this dataset is MUCH larger than what I can duplicate here, I'd like to know how to find what pairs of drugs lead to what reactions and in what frequency . 请记住,此数据集比我在此处可以复制的数据集要大得多,我想知道如何找到哪些药物对会导致什么反应和发生频率 。
Most importantly, I am interested in how to approach a problem like this. 最重要的是,我对如何解决这样的问题很感兴趣。 Is the data structured correctly?
数据的结构是否正确? What concepts or libraries should I read about?
我应该阅读哪些概念或库?
Here's a link to some real data: https://www.dropbox.com/s/kzx4mpyytbo9zil/query_result.csv 这是一些真实数据的链接: https : //www.dropbox.com/s/kzx4mpyytbo9zil/query_result.csv
ID DRUG REACTION
1 1827 ASPIRIN CHEST PAIN
2 1827 CLARINEX CHEST PAIN
3 1827 ASPIRIN COUGH
4 1827 CLARINEX COUGH
5 1827 ASPIRIN HAEMOGLOBIN DECREASED
6 1827 CLARINEX HAEMOGLOBIN DECREASED
7 1827 ASPIRIN NEUTROPHIL COUNT INCREASED
8 1827 CLARINEX NEUTROPHIL COUNT INCREASED
9 1827 ASPIRIN PHARYNGOLARYNGEAL PAIN
10 1827 CLARINEX PHARYNGOLARYNGEAL PAIN
...
In my teeny little brain, the end result looks something like this... 在我的小脑袋中,最终结果看起来像这样……
Drug1 Drug2 Reaction Frequency
1 tylenol alcohol hepatic failure 298
2 advil aleve bleeding 201
3 aspirin advil renal failure 199
4 docusate senna diarrhea 146
5 senna sudafed palpitations 121
6 xanax alcohol sedation 111
7 clarinex benadryl dry mouth 96
...
569 ASPIRIN CLARINEX CHEST PAIN 2
Drug1 and Drug2 are the drug pairs with the highest frequency from the entire dataset. Drug1和Drug2是整个数据集中频率最高的药物对。 A "drug pair" is defined as any combination of two drugs with the same report ID.
“药物对”定义为具有相同报告ID的两种药物的任意组合。 The example output above would be interpreted as, "row 1 had 298 unique report IDs for which hepatic failure was the reaction."
上面的示例输出将被解释为“第1行具有298个唯一的报告ID,其反应是肝衰竭。”
Ok, I try an answer - I hope I got the question correctly. 好的,我尝试一个答案-希望我能正确回答问题。 The code is rather intended to give some ideas than to be elegant/final.
该代码旨在提供一些想法,而不是优雅/最终的。
Please note: I intentionally used for loops instead of possible vectorisation / apply functions, to make it easier to understand (those who are familiar with apply functions will also undertand the for loop ;-)). 请注意:我故意使用for循环,而不是使用可能的向量化/ apply函数,以使其更易于理解(熟悉apply函数的人也会理解for循环;-)。
Please note 2: Since I don't have more than a tiny piece of data, I could not test the code for the whole dataset! 请注意2:由于我没有太多的数据,因此无法测试整个数据集的代码!
EDIT : columns based on example above - possibly different from csv data. 编辑 :基于以上示例的列-可能与csv数据不同。
Key points are: 关键点是:
unique
, [
etc. unique
[
等 utils::combn
to get combinations utils::combn
获取组合 Hope that helps! 希望有帮助!
require(utils)
df <- read.table(header=TRUE,
text="LINE ID DRUG REACTION
1 1827 ASPIRIN CHEST_PAIN
2 1827 CLARINEX CHEST_PAIN
3 1827 ASPIRIN COUGH
4 1827 CLARINEX COUGH
5 1827 ASPIRIN HAEMOGLOBIN_DECREASED
6 1827 CLARINEX HAEMOGLOBIN_DECREASED
7 1827 ASPIRIN NEUTROPHIL_COUNT_INCREASED
8 1827 CLARINEX NEUTROPHIL_COUNT_INCREASED
9 1827 ASPIRIN PHARYNGOLARYNGEAL_PAIN
10 1827 CLARINEX PHARYNGOLARYNGEAL_PAIN")
# temporary object to collect if a combination is present
Results <- data.frame(Drug1=NA, Drug2=NA, Reaction=NA, Reaction.occurs=NA)
n=1 # start first line in Results object
# walk through each ID ...
for (ID in unique(df$ID)) {
# ... and each possible pair of drugs within a (report) ID ...
drug.pairs <- utils::combn(x=unique(df[df$ID == ID, "DRUG"]), m=2) # the columns
for (ii in 1:ncol(drug.pairs)) {
# ... and each reaction ...
for (reaction in unique(df$REACTION)) {
Results[n, "Drug1"] <- drug.pairs[1,ii]
Results[n, "Drug2"] <- drug.pairs[2,ii]
Results[n, "Reaction"] <- reaction
Results[n, "Reaction.occurs"] <- drug.pairs[1,ii] %in% df[df$REACTION == reaction & df$ID == ID, "DRUG"] &
drug.pairs[2,ii] %in% df[df$REACTION == reaction & df$ID == ID, "DRUG"]
n <- n+1
}
}
}
head(Results)
# then find the unique Drug1 - Drug2 -Reaction combinations, and count the TRUE values:
(Results[!duplicated(Results[,1:3]), ][,1:3])
(unique(Results[, 1:3]))
# Results2 contains only the unique combinations
Results2 <- Results[!duplicated(Results[,1:3]), ][,1:3]
# calculatethe frequencies
for (i in 1:nrow(Results2)) {
Results2[i, "Frequency"] <- sum(Results[Results$Drug1 == Results2[i, "Drug1"] &
Results$Drug2 == Results2[i, "Drug2"] &
Results$Reaction == Results2[i, "Reaction"], ]$Reaction.occurs)
}
Results2
# --- end ----
gives: 得到:
Drug1 Drug2 Reaction Frequency
1 ASPIRIN CLARINEX CHEST_PAIN 1
2 ASPIRIN CLARINEX COUGH 1
3 ASPIRIN CLARINEX HAEMOGLOBIN_DECREASED 1
4 ASPIRIN CLARINEX NEUTROPHIL_COUNT_INCREASED 1
5 ASPIRIN CLARINEX PHARYNGOLARYNGEAL_PAIN 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.