简体繁体 English

使用R处理多维（时空）数据的最佳实践

[英]Best practices for handling multidimensional (spatio temporal) data with R

原文 2016-05-21 12:15:30 5 1 r/ database/ postgresql/ gps/ dplyr

I have a question regarding the usage of a (Postgre)SQL Database in R: Many documentations on this topic stress the fact that it only makes sense to use SQL Databases in R if you are dealing with big data that doesn't fit in your ram (eg see here and here ). 我对R中的（Postgre）SQL数据库的使用有疑问：关于此主题的许多文档都强调这样一个事实，即如果您要处理不适合您的大数据，则只有在R中使用SQL数据库才有意义ram（例如，请参见此处和此处）。 I have a different situation and wasn't to find out if using a Postgre(SQL) database would be a reasonable decision. 我的情况有所不同，因此无法确定使用Postgre（SQL）数据库是否是一个合理的决定。 Here's my situation: 这是我的情况：

I'm well into a ecological research study where I analyse roe deer gps data at different sampling intervals (5min and 3h) over a span of about 2 years. 我非常擅长进行生态研究，我在大约2年的时间里以不同的采样间隔（5分钟和3小时）分析ro gps数据。 In addition, I integrate two axis acceleration data at a sampling interval of 4 minutes. 另外，我以4分钟的采样间隔集成了两个轴加速度数据。

To evaluate the behaviour of the roe deer in regard to humans, I analyse this multidimensional data comparing it to gps data of human beings taken at a sampling interval of 5 seconds. 为了评估the对人的行为，我分析了此多维数据，并将其与以5秒的采样间隔获取的人的gps数据进行了比较。

To date, I've been doing this analysis using dataframe/datatable with dplyr. 到目前为止，我一直在使用带有dplyr的dataframe / datatable进行此分析。 When merging all the data into one dataset, the resulting datatable becomes really wide . 将所有数据合并到一个数据集中时，结果数据表实际上变得很宽。 The columns include: Timestamp, ID, X/Y Positions, DOP and so forth of both humans and roe deer and all the resulting calculated values like distance, speed, elevation, proximity and lots more. 的列包括：时间戳，ID，X / Y位置，DOP等人和狍子和所有像距离，速度，高度，接近度和其它更多所得到的计算值的。

Also, the data is immensely long : Since the position of multiple roe deer and multiple humans are recorded simultaneously (many-to-many relationship), which leads to many repetitions in the dataframe. 而且，数据非常长：由于同时记录了多个ro和多个人的位置（多对多关系），因此导致数据帧中的许多重复。 On top of that, the different sampling intervals between humans and roe deer lead to repetition (of the roe deer positions) as well. 最重要的是，人和ro之间的不同采样间隔也会导致重复（positions位置）。

I'm hoping that with a database solution, I can 我希望有了数据库解决方案，我可以

write shorter, more elegent and concise code to analyse my data 编写更短，更简洁和简洁的代码来分析我的数据
keep a better overview of my data since it's 更好地了解我的数据，因为
- shorter (no repetitions) and 较短（无重复）和
- narrower (separate tables for the individual datasets with according relationships) 较窄（具有对应关系的各个数据集的单独表）

Would you recommend using a database in my case? 您是否建议在我的情况下使用数据库？ Would using a database solution help achieve the goals as described above? 使用数据库解决方案是否可以帮助实现上述目标？

1 个解决方案

Postgresql offers all the protection an ACID Database. PostgreSQL为所有保护提供了一个ACID数据库。

I use both R and Postgresql for work. 我同时使用R和Postgresql。 To be honest I prefer most things to be in the database. 老实说，我更喜欢大多数东西在数据库中。

In relation to your many to many data join Database normalization may help you there. 关于您的多对多数据连接，数据库规范化可以为您提供帮助。

Also a select from postgresql on the relevant columns and applying a filter to the rows may help. 同样，从postgresql的相关列中进行选择并对行应用过滤器可能也会有所帮助。 More information on select queries can be found here Ref Postgresql select tutorial 有关选择查询的更多信息，请参见Ref Postgresql选择教程。

EG 例如

Select column1, column3 from example_table where x =y etc and reading this into a data set. 从example_table中选择column1，column3，其中x = y等，然后将其读入数据集。

A Database is more suited for handling data while R is more suited to data analysis. 数据库更适合处理数据，而R更适合数据分析。

If you want to take a look at the commands calling Postgresql from R you could look at this article from Google. 如果您想看看从R调用Postgresql的命令，可以查看Google的这篇文章。

Ref RPostgresql 参考RPostgresql

Example 例

``` library(RPostgreSQL) ```库（RPostgreSQL）

loads the PostgreSQL driver 加载PostgreSQL驱动

drv <- dbDriver("PostgreSQL") drv <-dbDriver（“ PostgreSQL”）

Open a connection 打开连接

con <- dbConnect(drv, dbname="R_Project") con <-dbConnect（drv，dbname =“ R_Project”）

Submits a statement 提交声明

rs <- dbSendQuery(con, "select * from R_Users") rs <-dbSendQuery（con，“从R_Users中选择*”）