[英]r: Reading a dataset where each observation is split into 2 lines?
我正在嘗試讀取一個以空格分隔的文件,其中每個觀察點都被換行符中斷。 有沒有辦法對值進行read.table或fread掃描,直到整行滿?
標題和前兩行數據集如下所示:
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
750000 4411.765 41 1 1 1 1.5357
76 16.75596 17166.67 27177.04 170 41
1926395 4280.878 39 2 2 3 1.5357
192 22.49376 17166.67 27177.04 450 39
由於每行最終數據在輸入中被分成完整的2行,您可以嘗試這樣做 -
#read file
txt <- readLines("test.txt")
#extract header and remove it from data
df_header <- strsplit(txt[1], split=" ")[[1]]
txt <- txt[-1]
#merge every 2 subseqeunt lines into one to form a row of final dataframe
idx <- seq(1, length(txt), by=2)
txt[idx] <- paste(txt[idx], txt[idx+1])
txt <- txt[-(idx+1)]
#final data
df <- read.table(text=txt, col.names=df_header)
輸出是:
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.765 41 1 1 1 1.5357 76 16.75596 17166.67 27177.04 170 41
2 1926395 4280.878 39 2 2 3 1.5357 192 22.49376 17166.67 27177.04 450 39
示例數據: test.txt
包含
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
750000 4411.765 41 1 1 1 1.5357
76 16.75596 17166.67 27177.04 170 41
1926395 4280.878 39 2 2 3 1.5357
192 22.49376 17166.67 27177.04 450 39
我正在讀你的樣本數據,看起來像這樣......
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.76500 41.00 1.00 1 1 1.5357 NA NA NA NA NA NA
2 76 16.75596 17166.67 27177.04 170 41 NA NA NA NA NA NA NA
3 1926395 4280.87800 39.00 2.00 2 3 1.5357 NA NA NA NA NA NA
4 192 22.49376 17166.67 27177.04 450 39 NA NA NA NA NA NA NA
因為它們是替代品並且列數較少,所以我們可以輕松編碼
Data=read.csv("mydata.csv")
firstData=Data[!is.na(Data$naux),]
secondData=Data[is.na(Data$naux),]
firstData$hoursw=secondData$tsales
firstData$hourspw=secondData$sales
firstData$inv1=secondData$margin
firstData$inv2=secondData$nown
firstData$ssize=secondData$nfull
firstData$start=secondData$npart
Data=firstData
數據分為2.奇數行和偶數行。 然后用偶數roes數據中提供的正確值替換奇數行。 希望這對你有所幫助!
最終的輸出是
> firstData
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.765 41 1 1 1 1.5357 76 16.75596 17166.67 27177.04 170 41
3 1926395 4280.878 39 2 2 3 1.5357 192 22.49376 17166.67 27177.04 450 39
> secondData
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2 76 16.75596 17166.67 27177.04 170 41 NA NA NA NA NA NA NA
4 192 22.49376 17166.67 27177.04 450 39 NA NA NA NA NA NA NA
> Data
tsales sales margin nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
1 750000 4411.765 41 1 1 1 1.5357 76 16.75596 17166.67 27177.04 170 41
3 1926395 4280.878 39 2 2 3 1.5357 192 22.49376 17166.67 27177.04 450 39
這是一個data.table
解決方案(我已將您的示例復制到文件dfTest.txt
)。 查看評論以獲得解釋:
library(data.table)
#fill=TRUE fills empty cols due to irregular structure with NAs
dt=fread("dfTest.txt",header = TRUE,sep=" ",fill=TRUE)
#cols to fix
selCols=c("hoursw","hourspw","inv1","inv2","ssize","start")
#cols from which to read
otherCols=colnames(dt)[seq_along(selCols)]
#fill missing cols from leading rows and select every 2nd row afterwards
dt[,c(selCols):=shift(.SD,n=1L,type="lead"),
.SDcols=otherCols][seq(1,nrow(dt),2),]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.