Generate CSV for table with a lot of data

Question

I have a table in postgres database(AWS redshift actually), the data from this table needs to be exported to a CSV after some operations. As an example, consider a table Test, with columns A, B, C, D.

Column A, Column B, Column C, Column D
ValueA1 , ValueB1 , ValueC1 , 1
ValueA1 , ValueB2 , ValueC2 , 2

where A, B, C are strings and D is an integer.

An entry in this table means that for value of column A, column B, column C, D is the count.

The relationship between A, B, C is hierarchical A > B > C.

My requirement is that the CSV must have data corresponding to postgres rollup operation. ie, Example CSV:

Column A, Column B, Column C, Sum(D)
ValueA1 ,         ,         ,  3
        , ValueB1  ,         , 1
        ,         , ValueC1 ,  1 
        , ValueB2  ,         , 2
        ,         , ValueC2 ,  2

Currently, my approach is to do a group by on A, B, C and get the sum of Column D. Hierarchical aggregation is being done in the application. I cant get the whole set of results (70 million or so) at one go, but if I used limit and offset in postgres to get data in paginated manner, there is a possibility that I might end up splitting the hierarchical data leading to ValueA been seen twice (or more) in the CSV.

Application is built using Java and JOOQ. The data is sent to the frontend (built using react) and CSV is written there.

Any help regarding how to get this CSV done is appreciated.

Answer 1

If I understand correctly, you'd like to make sure that each time you send a chunk of data, the chunk must contain ALL rows for any given value of Column A (which is present in the chunk). You could use the DENSE_RANK function like below -

SELECT *
FROM (
       SELECT
         ColumnA,
         ColumnB,
         ColumnC,
         dense_rank()
         OVER (
           ORDER BY ColumnA ASC ) AS dr,
         sum(ColumnD)             AS sumD
       FROM SomeTable
       GROUP BY ColumnA,
         ColumnB,
         ColumnC) AS sub_table
WHERE sub_table.dr BETWEEN 1 AND 5

In the last condition, you can supply the range of record numbers that you want in a chunk (the dense_rank() will increment everytime the value of ColumnA is changed)

You could refer to - https://docs.aws.amazon.com/redshift/latest/dg/r_WF_DENSE_RANK.html https://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_dense_rank_WF.html

Generate CSV for table with a lot of data

Question

1 answers

solution1
0 2018-06-29 11:31:48

Generate CSV for table with a lot of data

Question

1 answers

solution1 0 2018-06-29 11:31:48

solution1
0 2018-06-29 11:31:48