简体   繁体   English

将分解的数据块拉到R中

[英]Pulling an disaggregated blocks of data Set into R

I have a csv file that is unusually laid out. 我有一个异常布局的csv文件。 The data is not a contigous block at the top. 数据不是顶部的重点块。 The csv file can be characterized as such: csv文件的特征如下:

Comment Strings
Empty row
Comment String

[Desired Data with 10 columns and an undetermined number of rows]

Empty Row

Comment String 

[Desired Data with 10 columns and an undetermined number of rows]

Empty Row

Comment String 

[Desired Data with 10 columns and an undetermined number of columns] [具有10列和未确定列数的所需数据]

.... and so on and so forth.

As stated. 就像声明的那样。 Each block of data has a random number of rows. 每个数据块都有一个随机行数。

What would be the best way to pull this data into R? 将这些数据导入R的最佳方法是什么? The read.table/read.csv can only do so much. read.table/read.csv只能做这么多。

 read.table("C:\\Users\\Riemmman\\Desktop\\Historical Data\\datafile.csv",header=F,sep=",",skip=15,blank.lines.skip=T)

I just recently faced a problem like this. 我刚刚遇到这样的问题。 My solution was to use awk to separate out the different types of rows, load them into different tables in a dbms and use sql to create a flat file for loading into R. 我的解决方案是使用awk分离出不同类型的行,将它们加载到dbms中的不同表中,并使用sql创建一个平面文件以加载到R中。

Or maybe you can awk out only your desired data and load that, if you don't care about the comment strings. 或者,如果您不关心注释字符串,也许您只能删除所需的数据并加载它。

You might be able to use a combination of readLines and grep / grepl to help you figure out which lines to read. 您可以使用readLinesgrep / grepl的组合来帮助您确定要读取的行。

Here's an example. 这是一个例子。 The first part is just to make up some sample data. 第一部分是组成一些样本数据。

Create some sample data. 创建一些示例数据。

x <- tempfile(pattern="myFile", fileext=".csv")

cat("junk comment strings",
    "",
    "another junk comment string",
    "This,Is,My,Data",
    "1,2,3,4",
    "5,6,7,8",
    "",
    "back to comments",
    "This,Is,My,Data",
    "12,13,14,15",
    "15,16,17,18",
    "19,20,21,22", file = x, sep = "\n")

Step 1: Use readLines() to get the data into R 第1步:使用readLines()将数据导入R

In this step, we'll also drop the lines that we are not interested in. The logic is that we are only interested in lines where there is information in the form of (for a four-column dataset): 在这一步中,我们还将删除我们不感兴趣的行。逻辑是我们只对存在信息的行感兴趣(对于四列数据集):

something comma something comma something comma something 东西可以用逗号代替逗号


## Read the data into R
## Replace "con" with the actual path to your file
A <- readLines(con = x)

## Find and extract the lines where there are "data".
## My example dataset only has 4 columns.
## Modify for your actual dataset.
A <- A[grepl(paste(rep(".*", 4), collapse=","), A)]

Step 2: Identify the data ranges 第2步:确定数据范围

## Identify the header rows. -1 for use with read.csv
HeaderRows <- grep("^This,Is", A)-1

## Identify the number of rows per data group
N <- c(diff(HeaderRows)-1, length(A)-1)

Step 3: Read the data in 第3步:读取数据

Use the data range information to specify how many lines to skip before reading, and how many lines to read. 使用数据范围信息指定在读取之前要跳过的行数以及要读取的行数。

myData <- lapply(seq_along(HeaderRows), 
       function(x) read.csv(text = A, header = TRUE, 
                            nrows = N[x], skip = HeaderRows[x]))
myData
# [[1]]
#   This Is My Data
# 1    1  2  3    4
# 2    5  6  7    8
# 
# [[2]]
#   This Is My Data
# 1   12 13 14   15
# 2   15 16 17   18
# 3   19 20 21   22

If you want all of these in one data.frame instead of a list , use: 如果您想在一个data.frame而不是list中使用所有这些,请使用:

final <- do.call(rbind, myData)

Using the data generated by @Ananda Mahto, 使用@Ananda Mahto生成的数据,

file = x # change for the actual file name
alldata = readLines(file) # read all data
# count the fields in data (separated by comma)
nfields = count.fields(file=textConnection(alldata), sep=",", blank.lines.skip=FALSE) 
# asumme data has the 'mode' of the number of fields (can change for the actual number of colums)
dataFields = as.numeric(names(table(nfields))[which.max(table(nfields))]) 

alldata = alldata[nfields == dataFields] # read data lines only
header = alldata[1] # the header
alldata = c(header, alldata[alldata!=header]) # remove the extra headers
datos = read.csv(text=alldata) # read the data

  This Is My Data
1    1  2  3    4
2    5  6  7    8
3   12 13 14   15
4   15 16 17   18
5   19 20 21   22

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM