![](/img/trans.png)
[英]Importing/Conditioning a file.txt with a “kind” of json structure in R
[英]Load a file.txt into df in R
我需要幫助才能將文件內容轉換為 dataframe。
這是文件:
>OK0100087.1
0 375
376 750
751 1000
>OK0100088.1
0 87766
>OK0100089.1
0 66778
>OK0100090.1
0 47519
47520 73733
我的想法是我想在 df 中更改此文件內容,例如:
Name start end
OK0100087.1_0 0 375
OK0100087.1_1 376 750
OK0100087.1_2 751 1000
OK0100088.1 0 87766
OK0100089.1 0 66778
OK0100090.1_0 0 47519
OK0100090.1_1 47520 73733
如果在>OK...number
之后有幾行,我會在其中添加一個 _Number
開始是每行的第一個數字,結束是最后一個。
有人有想法嗎?
base
解決方案:
txt <- readLines("foo.txt")
grp <- cumsum(grepl("^>", txt))
Reduce(rbind, by(txt, grp, function(x){
name <- sub("^>", "", x[1])
cbind(Name = if(length(x) > 2) paste(name, seq_along(x[-1])-1, sep = "_") else name,
read.table(text = x[-1], col.names = c("start", "end")))
}))
# Name start end
# 1 OK0100087.1_0 0 375
# 2 OK0100087.1_1 376 750
# 3 OK0100087.1_2 751 1000
# 4 OK0100088.1 0 87766
# 5 OK0100089.1 0 66778
# 6 OK0100090.1_0 0 47519
# 7 OK0100090.1_1 47520 73733
同時使用data.table
path <- "file.txt"
OUT <- fread(path, sep = ",", header = FALSE)
OUT[,
setNames(c(V1[1L], tstrsplit(V1[-1L], " ")), c("Name", "Start", "End")),
by = cumsum(grepl("^>", V1))
][, Name := sub(">", "", Name)
][,
Name := if (.N>1L) sprintf("%s_%d", Name, 1L:.N - 1L) else Name,
by = Name
][, !"cumsum"]
# Name Start End
# 1: OK0100087.1_0 0 375
# 2: OK0100087.1_1 376 750
# 3: OK0100087.1_2 751 1000
# 4: OK0100088.1 0 87766
# 5: OK0100089.1 0 66778
# 6: OK0100090.1_0 0 47519
# 7: OK0100090.1_1 47520 73733
data.table解決方案
我保留了所有中間步驟,因此您可以檢查在此過程中實際完成了什么。
樣本數據
library( data.table )
DT <- fread(
text = ">OK0100087.1
0 375
376 750
751 1000
>OK0100088.1
0 87766
>OK0100089.1
0 66778
>OK0100090.1
0 47519
47520 73733", sep = "", header = FALSE)
# V1
# 1: >OK0100087.1
# 2: 0 375
# 3: 376 750
# 4: 751 1000
# 5: >OK0100088.1
# 6: 0 87766
# 7: >OK0100089.1
# 8: 0 66778
# 9: >OK0100090.1
# 10: 0 47519
# 11: 47520 73733
代碼
#split to list
L <- split( DT, cumsum( grepl( "^>OK", DT$V1 ) ) )
#use first row as name
names(L) <- sapply( L, function(x) x[1] )
#drop first element from list, split values to column
L2 <- lapply( L, function(x) {
tmp <- x[-1]
tmp[, c("start", "end") := tstrsplit( V1, " ") ][, V1 := NULL]
})
#bind together
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Name" )
#add_counter
ans[ ans[, if(.N>1) .I , by=.(Name)]$V1,
Name := paste0( Name, "_", seq_len(.N) - 1 ), by = .(Name) ][]
# Name start end
# 1: >OK0100087.1_0 0 375
# 2: >OK0100087.1_1 376 750
# 3: >OK0100087.1_2 751 1000
# 4: >OK0100088.1 0 87766
# 5: >OK0100089.1 0 66778
# 6: >OK0100090.1_0 0 47519
# 7: >OK0100090.1_1 47520 73733
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.