简体   繁体   English

在R中存储和处理面板数据的有效方法

[英]Efficient ways to store and process panel data in R

Suppose there are time series data of 1024 individuals stored in separate csv files. 假设在单独的csv文件中存储着1024个人的时间序列数据。 I fread them into memory, obtaining 1024 data.frames , in following form fread他们到内存中,获得1024个data.frames ,在下面的表格

Tables$Individual1 表$ Individual1

SampleDate,var1,var2,var3,...
2001-01-01,1001,2001,3001,...
2001-01-02,1002,2002,3002,...
2001-01-03,1004,2004,3004,...
...
2017-01-01,9999,9999,9999,...

Tables$Individual2 表$ Individual2

SampleDate,var1,var2,var3,...
1992-03-01,1101,2101,3101,...
1992-03-02,1102,2102,3102,...
1992-03-03,1104,2104,3104,...
...
2017-01-01,8888,8888,8888,...
... ...

The tables have different initial observation dates because the individuals have different dates of birth but each subsequent day corresponds to a row in order. 这些表具有不同的初始观察日期,因为个人的出生日期不同,但随后的每一天依次对应一行。 If I use an array to store the combined data, then many elements(days before birth) will be empty. 如果我使用数组存储组合数据,则许多元素(出生前的几天)将为空。 What is the best way to organize them in memory that allows quick access to cross sectional data? 在内存中组织它们以便快速访问横截面数据的最佳方法是什么? For example, I want to fetch var1,var3 at 2010-04-01 of all individuals which exists at that day. 例如,我想在2010-04-01提取当天存在的所有个人的var1,var3 Currently I have to sapply a function which extracts a part of each table and this is awfully slow. 目前,我必须sapply一个提取每个表的一部分的函数,这非常慢。

Another matter. 另一件事。 Lets say I need to sort these individuals by a function f(var1,var2,var3,...) at 8 different dates. 可以说我需要在8个不同的日期通过函数f(var1,var2,var3,...)对这些人进行排序。 Now this is an embarrassingly parallel task so I readily grab the parallel package, only to find that it takes forever to clusterExport those tables. 现在这是一个令人尴尬的并行任务,因此我随手抓取了并行程序包,却发现集群导出这些表花了永远的时间。 Is there any clusterExport variant that utilizes shared memory, or maybe I should switch to linux to make fork clusters? 是否有任何利用共享内存的clusterExport变体,或者我应该切换到linux来创建fork集群?

Any help will be appreciated. 任何帮助将不胜感激。

Why not add a field with individualsID and put all the data into one dataframe. 为什么不添加带有personalID的字段并将所有数据放入一个数据框。

Take df1 as your sample1 and df2 as your sample2, then 以df1为您的sample1,以df2为您的sample2,然后

df1$IndID <- "01"

adds the Individual ID to the dataframe, which leads to 将个人ID添加到数据框,从而导致

> df1
  SampleDate var1 var2 var3 IndID
1 2001-01-01 1001 2001 3001    01
2 2001-01-02 1002 2002 3002    01
3 2001-01-03 1004 2004 3004    01
4 2017-01-01 9999 9999 9999    01

Same with df2 与df2相同

df2$IndID <- "02"

And combine them to one dataframe 并将它们组合到一个数据帧

df <- rbind(df1,df2)

which leads to 这导致

> df
  SampleDate var1 var2 var3 IndID
1 2001-01-01 1001 2001 3001    01
2 2001-01-02 1002 2002 3002    01
3 2001-01-03 1004 2004 3004    01
4 2017-01-01 9999 9999 9999    01
5 1992-03-01 1101 2101 3101    02
6 1992-03-02 1102 2102 3102    02
7 1992-03-03 1104 2104 3104    02
8 2017-01-01 8888 8888 8888    02

then handling of the data is easy and timesaving - eg. 那么处理数据既简单又省时-例如 your question - fetch var1 and var3 您的问题-获取var1和var3

> df[df$SampleDate=="2017-01-01", c("var1","var3")]
  var1 var3
4 9999 9999
8 8888 8888

To sort the data ... 排序数据...

> library(dplyr)
> arrange(df, IndID, var1, var2, var3)

Partial answer: 部分答案:

lapply(Tables, '[[', 'var1')

Should return you a list containing the var1 column for each indvidual, you may be able to pass more arguments to the second portion to pull out only the required date values. 应该为您返回一个包含每个个人的var1列的列表,您也许可以将更多参数传递给第二部分,以仅提取所需的日期值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM