[英]Best practices for handling multidimensional (spatio temporal) data with R
I have a question regarding the usage of a (Postgre)SQL Database in R: Many documentations on this topic stress the fact that it only makes sense to use SQL Databases in R if you are dealing with big data that doesn't fit in your ram (eg see here and here ). 我对R中的(Postgre)SQL数据库的使用有疑问:关于此主题的许多文档都强调这样一个事实,即如果您要处理不适合您的大数据,则只有在R中使用SQL数据库才有意义ram(例如,请参见此处和此处 )。 I have a different situation and wasn't to find out if using a Postgre(SQL) database would be a reasonable decision.
我的情况有所不同,因此无法确定使用Postgre(SQL)数据库是否是一个合理的决定。 Here's my situation:
这是我的情况:
I'm well into a ecological research study where I analyse roe deer gps data at different sampling intervals (5min and 3h) over a span of about 2 years. 我非常擅长进行生态研究,我在大约2年的时间里以不同的采样间隔(5分钟和3小时)分析ro gps数据。 In addition, I integrate two axis acceleration data at a sampling interval of 4 minutes.
另外,我以4分钟的采样间隔集成了两个轴加速度数据。
To evaluate the behaviour of the roe deer in regard to humans, I analyse this multidimensional data comparing it to gps data of human beings taken at a sampling interval of 5 seconds. 为了评估the对人的行为,我分析了此多维数据,并将其与以5秒的采样间隔获取的人的gps数据进行了比较。
To date, I've been doing this analysis using dataframe/datatable with dplyr. 到目前为止,我一直在使用带有dplyr的dataframe / datatable进行此分析。 When merging all the data into one dataset, the resulting datatable becomes really wide .
将所有数据合并到一个数据集中时,结果数据表实际上变得很宽 。 The columns include: Timestamp, ID, X/Y Positions, DOP and so forth of both humans and roe deer and all the resulting calculated values like distance, speed, elevation, proximity and lots more.
的列包括:时间戳,ID,X / Y位置,DOP等人和狍子和所有像距离,速度,高度,接近度和其它更多所得到的计算值的。
Also, the data is immensely long : Since the position of multiple roe deer and multiple humans are recorded simultaneously (many-to-many relationship), which leads to many repetitions in the dataframe. 而且,数据非常长 :由于同时记录了多个ro和多个人的位置(多对多关系),因此导致数据帧中的许多重复。 On top of that, the different sampling intervals between humans and roe deer lead to repetition (of the roe deer positions) as well.
最重要的是,人和ro之间的不同采样间隔也会导致重复(positions位置)。
I'm hoping that with a database solution, I can 我希望有了数据库解决方案,我可以
Would you recommend using a database in my case? 您是否建议在我的情况下使用数据库? Would using a database solution help achieve the goals as described above?
使用数据库解决方案是否可以帮助实现上述目标?
Postgresql offers all the protection an ACID Database. PostgreSQL为所有保护提供了一个ACID数据库。
I use both R and Postgresql for work. 我同时使用R和Postgresql。 To be honest I prefer most things to be in the database.
老实说,我更喜欢大多数东西在数据库中。
In relation to your many to many data join Database normalization may help you there. 关于您的多对多数据连接, 数据库规范化可以为您提供帮助。
Also a select from postgresql on the relevant columns and applying a filter to the rows may help. 同样,从postgresql的相关列中进行选择并对行应用过滤器可能也会有所帮助。 More information on select queries can be found here Ref Postgresql select tutorial
有关选择查询的更多信息,请参见Ref Postgresql选择教程。
EG 例如
Select column1, column3 from example_table where x =y etc and reading this into a data set.
从example_table中选择column1,column3,其中x = y等,然后将其读入数据集。
A Database is more suited for handling data while R is more suited to data analysis. 数据库更适合处理数据,而R更适合数据分析。
If you want to take a look at the commands calling Postgresql from R you could look at this article from Google. 如果您想看看从R调用Postgresql的命令,可以查看Google的这篇文章。
Ref RPostgresql 参考RPostgresql
Example
例
``` library(RPostgreSQL)
```库(RPostgreSQL)
loads the PostgreSQL driver
加载PostgreSQL驱动
drv <- dbDriver("PostgreSQL")
drv <-dbDriver(“ PostgreSQL”)
Open a connection
打开连接
con <- dbConnect(drv, dbname="R_Project")
con <-dbConnect(drv,dbname =“ R_Project”)
Submits a statement
提交声明
rs <- dbSendQuery(con, "select * from R_Users")
rs <-dbSendQuery(con,“从R_Users中选择*”)
All the best 祝一切顺利
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.