簡體   English   中英

r:讀取數據集,其中每個觀察分為2行?

[英]r: Reading a dataset where each observation is split into 2 lines?

我正在嘗試讀取一個以空格分隔的文件,其中每個觀察點都被換行符中斷。 有沒有辦法對值進行read.table或fread掃描,直到整行滿?

標題和前兩行數據集如下所示:

   tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
       750000   4411.765         41          1          1          1     1.5357
           76   16.75596   17166.67   27177.04        170         41
      1926395   4280.878         39          2          2          3     1.5357
          192   22.49376   17166.67   27177.04        450         39

由於每行最終數據在輸入中被分成完整的2行,您可以嘗試這樣做 -

#read file
txt <- readLines("test.txt")

#extract header and remove it from data
df_header <- strsplit(txt[1], split=" ")[[1]]
txt <- txt[-1]

#merge every 2 subseqeunt lines into one to form a row of final dataframe
idx <- seq(1, length(txt), by=2)
txt[idx] <- paste(txt[idx], txt[idx+1])
txt <- txt[-(idx+1)]

#final data
df <- read.table(text=txt, col.names=df_header)

輸出是:

   tsales    sales margin nown nfull npart   naux hoursw  hourspw     inv1     inv2 ssize start
1  750000 4411.765     41    1     1     1 1.5357     76 16.75596 17166.67 27177.04   170    41
2 1926395 4280.878     39    2     2     3 1.5357    192 22.49376 17166.67 27177.04   450    39

示例數據: test.txt包含

tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
750000   4411.765         41          1          1          1     1.5357
76   16.75596   17166.67   27177.04        170         41
1926395   4280.878         39          2          2          3     1.5357
192   22.49376   17166.67   27177.04        450         39

我正在讀你的樣本數據,看起來像這樣......

   tsales      sales   margin     nown nfull npart   naux hoursw hourspw inv1 inv2 ssize start
1  750000 4411.76500    41.00     1.00     1     1 1.5357     NA      NA   NA   NA    NA    NA
2      76   16.75596 17166.67 27177.04   170    41     NA     NA      NA   NA   NA    NA    NA
3 1926395 4280.87800    39.00     2.00     2     3 1.5357     NA      NA   NA   NA    NA    NA
4     192   22.49376 17166.67 27177.04   450    39     NA     NA      NA   NA   NA    NA    NA

因為它們是替代品並且列數較少,所以我們可以輕松編碼

Data=read.csv("mydata.csv")
firstData=Data[!is.na(Data$naux),]
secondData=Data[is.na(Data$naux),]
firstData$hoursw=secondData$tsales
firstData$hourspw=secondData$sales
firstData$inv1=secondData$margin
firstData$inv2=secondData$nown
firstData$ssize=secondData$nfull
firstData$start=secondData$npart
Data=firstData

數據分為2.奇數行和偶數行。 然后用偶數roes數據中提供的正確值替換奇數行。 希望這對你有所幫助!

最終的輸出是

> firstData
   tsales    sales margin nown nfull npart   naux hoursw  hourspw     inv1     inv2 ssize start
1  750000 4411.765     41    1     1     1 1.5357     76 16.75596 17166.67 27177.04   170    41
3 1926395 4280.878     39    2     2     3 1.5357    192 22.49376 17166.67 27177.04   450    39

> secondData
  tsales    sales   margin     nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2     76 16.75596 17166.67 27177.04   170    41   NA     NA      NA   NA   NA    NA    NA
4    192 22.49376 17166.67 27177.04   450    39   NA     NA      NA   NA   NA    NA    NA

> Data
   tsales    sales margin nown nfull npart   naux hoursw  hourspw     inv1     inv2 ssize start
1  750000 4411.765     41    1     1     1 1.5357     76 16.75596 17166.67 27177.04   170    41
3 1926395 4280.878     39    2     2     3 1.5357    192 22.49376 17166.67 27177.04   450    39

這是一個data.table解決方案(我已將您的示例復制到文件dfTest.txt )。 查看評論以獲得解釋:

library(data.table)
#fill=TRUE fills empty cols due to irregular structure with NAs
dt=fread("dfTest.txt",header = TRUE,sep=" ",fill=TRUE)
#cols to fix
selCols=c("hoursw","hourspw","inv1","inv2","ssize","start")
#cols from which to read
otherCols=colnames(dt)[seq_along(selCols)]
#fill missing cols from leading rows and select every 2nd row afterwards
dt[,c(selCols):=shift(.SD,n=1L,type="lead"),
    .SDcols=otherCols][seq(1,nrow(dt),2),]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM