简体   繁体   中英

Read, Transpose Big Matrix and Save

You have a very big matrix saved in a csv file. You want to transpose it and save it into another file. You can not load all the data into memory at one time. How can you do it?

I think we can read a row from the file and transpose it to a column and write the column into a file. Reading and transposing rows to column is ok to me, but I don't know how to write into a file column by column. Anyone could implement?

Anyway I'll give you a hint:

ol = or x C + oc (consider arr[or][oc])

It is to be moved to new location nl in the transposed matrix, say nl = A[nr][nc], or in C/C++ terms

nl = nr x R + nc (R - column count, C is row count as the matrix is transposed) nr = oc and nc = or, so replacing these for nl,

nl = oc x R + or So,

ol     = or x C     + oc
ol x R = or x C x R + oc x R
       = or x N     + oc x R    (from the fact R * C = N)
       = or x N     + (nl - or) --- from [eq 1]
       = or x (N-1) + nl

OR,

nl = ol x R - or x (N-1)

the values of nl and ol never go beyond N-1, so considering modulo division on both the sides by (N-1), we get the following based on properties of congruence,

nl mod (N-1) = (ol x R - or x (N-1)) mod (N-1)
             = (ol x R) mod (N-1) - or x (N-1) mod(N-1)
             = ol x R mod (N-1), since second term evaluates to zero
nl = (ol x R) mod (N-1), since nl is always less than N-1

So now you may just read one element at a time and put it to its correct position in the corresponding transposed matrix.

The program 'transpose' from https://github.com/micans/reaper may help here. It loads the matrix into memory as a single string, then writes the transposed result to file without creating it in memory. Hence the memory overhead is absolutely limited to size of the matrix on disk (uncompressed). The program can read/write compressed data, and the row and cell separators are customisable (default '\\n' and '\\t'). In a simple test on a 60460 x 4671 matrix (compressed size 125M) it used about 20 times less memory than Python + pandas, and about 12 times less memory than R, in both cases being approximately 13 times faster. An upside is that no rounding or truncating of data happens, every field is copied as a sequence of bytes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM