簡體   English   中英

在R加載一個file.txt到df

[英]Load a file.txt into df in R

我需要幫助才能將文件內容轉換為 dataframe。

這是文件:

>OK0100087.1
0 375
376 750
751 1000
>OK0100088.1
0 87766
>OK0100089.1
0 66778
>OK0100090.1
0 47519
47520 73733

我的想法是我想在 df 中更改此文件內容,例如:

Name           start end
OK0100087.1_0  0      375
OK0100087.1_1  376    750
OK0100087.1_2  751    1000
OK0100088.1    0      87766
OK0100089.1    0      66778
OK0100090.1_0  0      47519
OK0100090.1_1  47520  73733

如果在>OK...number之后有幾行,我會在其中添加一個 _Number

開始是每行的第一個數字,結束是最后一個。

有人有想法嗎?

base解決方案:

txt <- readLines("foo.txt")
grp <- cumsum(grepl("^>", txt))
Reduce(rbind, by(txt, grp, function(x){
  name <- sub("^>", "", x[1])
  cbind(Name = if(length(x) > 2) paste(name, seq_along(x[-1])-1, sep = "_") else name,
        read.table(text = x[-1], col.names = c("start", "end")))
}))

#            Name start   end
# 1 OK0100087.1_0     0   375
# 2 OK0100087.1_1   376   750
# 3 OK0100087.1_2   751  1000
# 4   OK0100088.1     0 87766
# 5   OK0100089.1     0 66778
# 6 OK0100090.1_0     0 47519
# 7 OK0100090.1_1 47520 73733

同時使用data.table

path <- "file.txt"

OUT <- fread(path, sep = ",", header = FALSE)
OUT[, 
    setNames(c(V1[1L], tstrsplit(V1[-1L], " ")), c("Name", "Start", "End")), 
    by = cumsum(grepl("^>", V1))
    ][, Name := sub(">", "", Name)
      ][, 
        Name := if (.N>1L) sprintf("%s_%d", Name, 1L:.N - 1L) else Name, 
        by = Name
        ][, !"cumsum"]

#             Name Start   End
# 1: OK0100087.1_0     0   375
# 2: OK0100087.1_1   376   750
# 3: OK0100087.1_2   751  1000
# 4:   OK0100088.1     0 87766
# 5:   OK0100089.1     0 66778
# 6: OK0100090.1_0     0 47519
# 7: OK0100090.1_1 47520 73733

data.table解決方案

我保留了所有中間步驟,因此您可以檢查在此過程中實際完成了什么。

樣本數據

library( data.table )

DT <- fread( 
text = ">OK0100087.1
0 375
376 750
751 1000
>OK0100088.1
0 87766
>OK0100089.1
0 66778
>OK0100090.1
0 47519
47520 73733", sep = "", header = FALSE)

#              V1
# 1: >OK0100087.1
# 2:        0 375
# 3:      376 750
# 4:     751 1000
# 5: >OK0100088.1
# 6:      0 87766
# 7: >OK0100089.1
# 8:      0 66778
# 9: >OK0100090.1
# 10:      0 47519
# 11:  47520 73733

代碼

#split to list
L <- split( DT, cumsum( grepl( "^>OK", DT$V1 ) ) )
#use first row as name
names(L) <- sapply( L, function(x) x[1] )
#drop first element from list, split values to column
L2 <- lapply( L, function(x) { 
  tmp <- x[-1]
  tmp[, c("start", "end") := tstrsplit( V1, " ") ][, V1 := NULL]
})
#bind together
ans <- rbindlist( L2, use.names = TRUE, fill = TRUE, idcol = "Name" )
#add_counter
ans[ ans[, if(.N>1) .I , by=.(Name)]$V1,
     Name := paste0( Name, "_", seq_len(.N) - 1 ), by = .(Name) ][]

#              Name start   end
# 1: >OK0100087.1_0     0   375
# 2: >OK0100087.1_1   376   750
# 3: >OK0100087.1_2   751  1000
# 4:   >OK0100088.1     0 87766
# 5:   >OK0100089.1     0 66778
# 6: >OK0100090.1_0     0 47519
# 7: >OK0100090.1_1 47520 73733

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM