[英]How do I remove all rows from data frame that equal unique value in seperate data frame in R?
[英]How do I remove unique rows based on certain column and row information in a data frame in R
我有一個包含染色體坐標、峰相交和其他相關信息的數據框,共有 127,471 行和 11 列(下面的簡表)。
UnionChr | 聯合啟動 | 聯合端 | IntersectChr | 相交開始 | 相交結束 | 相交名稱 | 重疊 | 基因型 |
---|---|---|---|---|---|---|---|---|
chr1 | 3657227 | 3658092 | . | -1 | -1 | . | 0 | 重量 |
chr1 | 3657227 | 3658092 | chr1 | 3657227 | 3658092 | dko_k27_peak_1 | 865 | DKO |
chr1 | 3658443 | 3664519 | chr1 | 3658443 | 3662838 | wt_k27_peak_1 | 4395 | 重量 |
chr1 | 3658443 | 3664519 | chr1 | 3663340 | 3664519 | wt_k27_peak_2 | 1179 | 重量 |
chr1 | 3658443 | 3664519 | chr1 | 3658833 | 3664156 | dko_k27_peak_2 | 5323 | DKO |
chr1 | 3665636 | 3666032 | chr1 | 3665705 | 3666032 | wt_k27_peak_3 | 327 | 重量 |
chr1 | 3665636 | 3666032 | chr1 | 3665636 | 3665919 | dko_k27_peak_3 | 283 | DKO |
chr1 | 4468858 | 4469245 | chr1 | 4468858 | 4469245 | wt_k27_peak_4 | 387 | 重量 |
chr1 | 4468858 | 4469245 | . | -1 | -1 | . | 0 | DKO |
chr1 | 4472410 | 4473380 | . | -1 | -1 | . | 0 | 重量 |
chr1 | 4472410 | 4473380 | chr1 | 4472410 | 4473380 | dko_k27_peak_4 | 970 | DKO |
我想刪除任何具有唯一峰值相交的行。 唯一相交是與下一行具有相同 UnionChr、UnionStart 和 UnionEnd 位置的行,但其中一行的 Overlap 值為 0。例如,第 1 行和第 2 行、第 8 行和第 9 行以及第 10 和 11 行是示例我想要刪除的唯一峰值相交的行。 我想保留所有其他行。 在這種情況下,第 3-7 行。
我很難找到消除唯一行的方法。 我已經嘗試使用重復()和唯一()進行子集,for循環。 我知道這在 python 中會更容易,但我不知道 python,我只需要這樣做一次。
謝謝! 下面是制作數據框的代碼:
df <- data.frame(UnionChr=c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
UnionStart = c(3657227, 3657227, 3658443, 3658443, 3658443, 3665636, 3665636, 4468858, 4468858, 4472410, 4472410),
UnionEnd = c(3658092, 3658092, 3664519, 3664519, 3664519, 3666032, 3666032, 4469245, 4469245, 4473380, 4473380),
IntersectChr = c("." , "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1", ".", ".", "chr1"),
IntersectStart = c(-1, 3657227, 3658443, 3663340, 3658833, 3665705, 3665636, 4468858, -1 , -1, 4472410),
IntersectEnd = c( -1, 3658092, 3662838, 3664519, 3664156, 3666032, 3665919, 4469245, -1, -1, 4473380),
IntersectName = c(".", "dko_k27_peak_1", "wt_k27_peak_1", "wt_k27_peak_2", "dko_k27_peak_2", "wt_k27_peak_3", "dko_k27_peak_3", "wt_k27_peak_4", ".", ".", "dko_k27_peak_4"),
Overlap = c(0, 865, 4395, 1179, 5323, 327, 283, 387, 0, 0, 970),
Genotype = c("WT", "DKO", "WT", "WT", "DKO", "WT", "DKO", "WT", "DKO", "WT", "DKO"))
如果您想與下面的行和上面的行(您的示例建議)進行比較,我認為您正在尋找這樣的東西。 您的描述有點模糊,所以請確認這是在做您認為應該做的事情。
我沒有建立一個很長的過濾條件,而是分幾步將其分解。 這更具可讀性,您可以查看這些新列以檢查過濾條件的每個部分,以確保它符合您的預期。
library(dplyr)
filtered_df <- df %>%
group_by(UnionChr) %>%
mutate(
same_start = UnionStart == lead(UnionStart) | UnionStart == lag(UnionStart),
same_end = UnionEnd == lead(UnionEnd) | UnionEnd == lag(UnionEnd),
zero_overlap = (Overlap == 0 | lead(Overlap == 0) | lag(Overlap == 0)),
combined = !(same_start & same_end & zero_overlap)
) %>%
filter(combined) %>%
select(-(same_start:combined))
結果:
# A tibble: 5 × 9 # Groups: UnionChr [1] UnionChr UnionStart UnionEnd IntersectChr IntersectStart IntersectEnd IntersectName Overlap <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> 1 chr1 3658443 3664519 chr1 3658443 3662838 wt_k27_peak_1 4395 2 chr1 3658443 3664519 chr1 3663340 3664519 wt_k27_peak_2 1179 3 chr1 3658443 3664519 chr1 3658833 3664156 dko_k27_peak_2 5323 4 chr1 3665636 3666032 chr1 3665705 3666032 wt_k27_peak_3 327 5 chr1 3665636 3666032 chr1 3665636 3665919 dko_k27_peak_3 283 # … with 1 more variable: Genotype <chr>
編輯:
根據您的評論,我認為這實際上更容易。 找到具有相同Union*
變量的行組,然后檢查該組中是否有任何Overlap
為 0。如果是,則扔掉整個組。
df %>%
group_by(UnionChr, UnionStart, UnionEnd) %>%
filter(!any(Overlap == 0))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.