I am creating my own R package and I was wondering what are the possible methods that I can use to add (time-series) datasets to my package. Here are the specifics:
I have created a package subdirectory called data and I am aware that this is the location where I should save the datasets that I want to add to my package. I am also cognizant of the fact that the files containing the data may be. rda , .txt , or .csv files.
Each series of data that I want to add to the package consists of a single column of numbers (eg. of the form 340 or 4.5) and each series of data differs in length.
So far, I have saved all of the datasets into a .txt file. I have also successfully loaded the data using the data() function. Problem not solved, however.
The problem is that each series of data loads as a factor except for the series greatest in length. The series that load as factors contain missing values (of the form '.'). I had to add these missing values in order to make each column of data the same in length. I tried saving the data as unequal columns, but I received an error message after calling data() .
A consequence of adding missing values to get the data to load is that once the data is loaded, I need to remove the NA's in order to get on with my analysis of the data, So. this clearly is not a good way of doing things.
Ideally (I suppose), I would like the data to load as numeric vectors or as a list. In this way, I wouldn't need the NA's appended to the end of each series.
How do I solve this problem? Should I save all of the data into one single file? If so, in what format should I do it? Perhaps I should save the datasets into a number of files? Again, in which format? What is the best practical way of doing this? Any tips would greatly be appreciated.
I'm not sure if I understood your question correctly. But, if you edit your data in your favorite format and save with
save(myediteddata, file="data.rda")
The data should be loaded exactly the way you saw it in R.
To load all files in data directory you should add
LazyData: true
To your DESCRIPTION file, in your package.
If this don't help you could post one of your files and a print of the format you want, this will help us to help you ;)
In addition to saving as rda files you could also choose to load them as numeric with:
read.table( ... , colClasses="numeric")
Or as non-factor-text:
read.table( ..., as.is=TRUE) # which does pretty much the same as stringsAsFactors=FALSE
read.table( ..., colClasses="character")
It also appears that the data
function would accept these arguments sinc it is documented to be a simple wrapper for read.table(..., header=TRUE)
.
Preferred saving location of your data depends on its format.
As Hadley suggested:
- If you want to store binary data and make it available to the user, put it in
data/
. This is the best place to put example datasets.- If you want to store parsed data, but not make it available to the user, put it in
R/sysdata.rda
. This is the best place to put data that your functions need.- If you want to store raw data, put it in
inst/extdata
.
I suggest you have a look at the linked chapter as it goes into detail about working with data when developing R packages.
/data
and place any data in it. Use only .rda
and .RData
files (recommended here ).rda
file name as recommended here : save(oceans, file = "data/oceans.rda", version = 2)
DESCRIPTION
LazyData: true
save()
- in this case called 'oceans' - ie so no need to assign it with <-
):load("data/oceans.rda")
oceans <- function() {
e <- new.env()
oceans <- load("data/oceans.rda", envir = e)
e$oceans
}
oceans() # Returns dataset as expected
version = 2
ensures your data object is available for R versions prior to 3.5.0 Ie it prevents this warning: WARNING: Added dependency on R >= 3.5.0 because serialized objects in serialize/load version 3 cannot be read in older versions of R.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.