简体   繁体   中英

Plotting huge data files in R?

I have a input file that has about 20 million lines. The size of the file is about 1.2 G. Is there anyway I can plot the data in R. Some of the columns have categories, most of them are numbers.

I have tried my plotting script with a small subset of the input file about 800K lines, but even though i have about 8G of RAM, I dont seem to be able to plot all the data. Is there any simple way to do this.

Without a more clear description of the kind of plot you want, it is hard to give concrete suggestions. However, in general there is no need to plot 20 million points in a plot. For example a timeseries could be represented by a splines fit, or some kind of average, eg aggregate hourly data to daily averages. Alternatively, you draw some subset of the data, eg only one point per day in the example of the timeseries. So I think your challenge is not as much getting 20M points, or even 800k, on a plot, but how to aggregate your data effectively in such a way that it conveys the message you want to tell.

The package hexbin to plot hexbins instead of scatterplots for pairs of variables as suggested by Ben Bolker in Speed up plot() function for large dataset worked for me for 2 million records fairly with 4GB RAM. But it failed for 200 million records/rows for same set of variables. I tried reducing the bin size to adjust computation time vs. RAM usage but it did not help.

For 20 million records, you can try out hexbins with xbins = 20,30,40 to start with.

plotting directly into a raster file device (calling png() for instance) is a lot faster. I tried plotting rnorm(100000) and on my laptop X11 cairo plot took 2.723 seconds, while png device finished in 2.001 seconds. with 1 million points, the numbers are 27.095 and 19.954 seconds.

I use Fedora Linux and here is the code.

f = function(n){
x = rnorm(n)
y = rnorm(n)
png('test.png')
plot(x, y)
dev.off()}

g = function(n){
x = rnorm(n)
y = rnorm(n)
plot(x, y)}

system.time(f(100000))
system.time(g(100000))

用memory.limit()增加内存对我有帮助......这是用于绘制ggplot近36K的记录。

使用memory.limit(size=2000) (或更大的东西)帮助扩展可用内存吗?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM