简体   繁体   中英

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.

Is this possible? I cannot seem to find an answer anywhere.

If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.

And that step is easier to happen outside R.

(1) Linux Shell:

Assuming your data falls into a consistent format. Each row is one record. You can do:

sort -R data | head -n 1000 >data.sample

This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample

(2) If the data is not small enough to fit into memory.

There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:

select * from tablename order by rand() limit 1000

You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.

These are the two most commonly used ways based on my experience for dealing with 'big' data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM