简体   繁体   English

在 R 中“按部分”绘制矩阵?

[英]Plotting a matrix “by parts” in R?

I have a 50k by 50k square matrix saved to disk in a text file and I would like to produce a simple histogram to see the distribution of the values in the matrix.我有一个 50k x 50k 方矩阵以文本文件的形式保存到磁盘,我想生成一个简单的直方图来查看矩阵中值的分布。

Obviously, when I try to load the matrix in R by using read.table() , a memory error is encountered as the matrix is too big.显然,当我尝试使用read.table()加载 R 中的矩阵时,由于矩阵太大,会遇到 memory 错误。 Is there anyway I could possibly load smaller submatrices one at a time, but still produce a histogram that considers all the values of the original matrix?无论如何,我是否可以一次加载一个较小的子矩阵,但仍会产生一个考虑原始矩阵所有值的直方图? I can indeed load smaller submatrices, but I just override the histogram that I had for the last submatrix with the distribution of the new one.我确实可以加载较小的子矩阵,但我只是用新子矩阵的分布覆盖了我对最后一个子矩阵的直方图。

Here's an approach.这是一种方法。 I don't have all the details because you did not provide sample data or the expected output, but one way to do this is through the read_chunked_csv function in the readr package.我没有所有详细信息,因为您没有提供示例数据或预期的 output,但一种方法是通过read_chunked_csv ZEFE90A8E604A7C840E8Z8D03A 中的 read_chunked_csv function First, you will need to write your summarisation function and then apply this to each chunk.首先,您需要编写摘要 function,然后将其应用于每个块。 See the below for a full repex.请参阅下面的完整重复。


# Call the Required Libraries
library(dplyr)
library(ggplot2)
library(readr)

# First Generate Some Fake Data
temp <- tempfile(fileext = ".csv")

fake_dat <- as.data.frame(matrix(rnorm(1000*100), ncol = 100))
write_csv(fake_dat, temp)



# Now write a summarisation function
# This will be applied to each chunk that is read into
# memory
summarise_for_hist <- function(x, pos){
  x %>% 
    mutate(added_bin = cut(V1, breaks = -6:6)) %>% 
    count(added_bin)
}

# Note that I manually set the cutpoints or "breaks"
# argument. You would need to refine this based on your
# data and subject matter expertise

# A

small_read <- read_csv_chunked(temp, # data
                               DataFrameCallback$new(summarise_for_hist),
                               chunk_size = 200 # number of lines to read
                               )

Now that we have summarised our data, we can combine and plot it.现在我们已经总结了我们的数据,我们可以将它与 plot 结合起来。


# Generate our histogram by combining all of the results
# and plotting

small_read %>% 
  group_by(added_bin) %>% 
  summarise(total = sum(n)) %>% 
  ggplot(aes(added_bin, total))+
  geom_col()

This will yield the following:这将产生以下结果:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM