Read, Transpose Big Matrix and Save

Question

You have a very big matrix saved in a csv file. You want to transpose it and save it into another file. You can not load all the data into memory at one time. How can you do it?

I think we can read a row from the file and transpose it to a column and write the column into a file. Reading and transposing rows to column is ok to me, but I don't know how to write into a file column by column. Anyone could implement?

Answer 1

Anyway I'll give you a hint:

ol = or x C + oc (consider arr[or][oc])

It is to be moved to new location nl in the transposed matrix, say nl = A[nr][nc], or in C/C++ terms

nl = nr x R + nc (R - column count, C is row count as the matrix is transposed) nr = oc and nc = or, so replacing these for nl,

nl = oc x R + or So,

ol     = or x C     + oc
ol x R = or x C x R + oc x R
       = or x N     + oc x R    (from the fact R * C = N)
       = or x N     + (nl - or) --- from [eq 1]
       = or x (N-1) + nl

OR,

nl = ol x R - or x (N-1)

the values of nl and ol never go beyond N-1, so considering modulo division on both the sides by (N-1), we get the following based on properties of congruence,

nl mod (N-1) = (ol x R - or x (N-1)) mod (N-1)
             = (ol x R) mod (N-1) - or x (N-1) mod(N-1)
             = ol x R mod (N-1), since second term evaluates to zero
nl = (ol x R) mod (N-1), since nl is always less than N-1

So now you may just read one element at a time and put it to its correct position in the corresponding transposed matrix.

Answer 2

The program 'transpose' from https://github.com/micans/reaper may help here. It loads the matrix into memory as a single string, then writes the transposed result to file without creating it in memory. Hence the memory overhead is absolutely limited to size of the matrix on disk (uncompressed). The program can read/write compressed data, and the row and cell separators are customisable (default '\\n' and '\\t'). In a simple test on a 60460 x 4671 matrix (compressed size 125M) it used about 20 times less memory than Python + pandas, and about 12 times less memory than R, in both cases being approximately 13 times faster. An upside is that no rounding or truncating of data happens, every field is copied as a sequence of bytes.

Read, Transpose Big Matrix and Save

Question

2 answers

solution1
0 2015-03-30 00:02:42

solution2
0 2019-04-25 14:07:37

Read, Transpose Big Matrix and Save

Question

2 answers

solution1 0 2015-03-30 00:02:42

solution2 0 2019-04-25 14:07:37

solution1
0 2015-03-30 00:02:42

solution2
0 2019-04-25 14:07:37