简体   繁体   English

R:如何在没有RAM限制的情况下快速读取大型.dta文件

[英]R: How to quickly read large .dta files without RAM Limitations

I have a 10 GB .dta Stata file and I am trying to read it into 64-bit R 3.3.1. 我有一个10 GB .dta Stata文件,我试图将其读入64位R 3.3.1。 I am working on a virtual machine with about 130 GB of RAM (4 TB HD) and the .dta file is about 3 million rows and somewhere between 400 and 800 variables. 我正在使用大约130 GB RAM(4 TB HD)的虚拟机,.dta文件大约有300万行,大约有400到800个变量。

I know data.table() is the fastest way to read in .txt and .csv files, but does anyone have a recommendation for reading largeish .dta files into R? 我知道data.table()是读取.txt和.csv文件的最快方法,但有人建议将大型.dta文件读入R吗? Reading the file into Stata as a .dta file requires about 20-30 seconds, although I need to set my working memory max prior to opening the file (I set the max at 100 GB). 将文件作为.dta文件读入Stata需要大约20-30秒,尽管我需要在打开文件之前设置我的工作内存最大值(我将最大值设置为100 GB)。

I have not tried importing to .csv in Stata, but I hope to avoid touching the file with Stata. 我没有尝试在Stata中导入.csv,但我希望避免用Stata触摸文件。 A solution is found via Using memisc to import stata .dta file into R but this assumes RAM is scarce. 通过使用memisc将stata .dta文件导入R中可以找到解决方案,但这假设RAM很少。 In my case, I should have sufficient RAM to work with the file. 在我的情况下,我应该有足够的RAM来处理文件。

I recommend the haven R package . 我推荐haven R套餐 Unlike foreign , It can read the latest Stata formats: foreign不同,它可以读取最新的Stata格式:

library(haven)
data <- read_dta('myfile.dta')

Not sure how fast it is compared to other options, but your choices for reading Stata files in R are rather limited. 不确定与其他选项相比有多快,但您在R中读取Stata文件的选择相当有限。 My understanding is that haven wraps a C library, so it's probably your fastest option. 我的理解是haven包装了一个C库,所以它可能是你最快的选择。

The fastest way to load a large Stata dataset in R is using the readstata13 package. 在R中加载大型Stata数据集的最快方法是使用readstata13包。 I have compared the performance of foreign , readstata13 , and haven packages on a large dataset in this post and the results repeatedly showed that readstata13 is the fastest available package for reading Stata dataset in R. 在本文中对大型数据集中的foreignreadstata13haven软件包的性能进行了比较,结果反复显示readstata13是用于读取R中Stata数据集的最快可用软件包。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM