简体   繁体   English

为包含大量数据的表生成CSV

[英]Generate CSV for table with a lot of data

I have a table in postgres database(AWS redshift actually), the data from this table needs to be exported to a CSV after some operations. 我在postgres数据库中有一个表(实际上是AWS Redshift),经过一些操作后,该表中的数据需要导出到CSV。 As an example, consider a table Test, with columns A, B, C, D. 例如,考虑一个具有列A,B,C,D的表Test。

Column A, Column B, Column C, Column D
ValueA1 , ValueB1 , ValueC1 , 1
ValueA1 , ValueB2 , ValueC2 , 2     

where A, B, C are strings and D is an integer. 其中A,B,C是字符串,D是整数。

An entry in this table means that for value of column A, column B, column C, D is the count. 该表中的条目表示对于A列,B列,C列,D列的值是计数。

The relationship between A, B, C is hierarchical A > B > C. A,B,C之间的关系是等级A> B>C。

My requirement is that the CSV must have data corresponding to postgres rollup operation. 我的要求是CSV必须具有与postgres汇总操作相对应的数据。 ie, Example CSV: 即CSV范例:

Column A, Column B, Column C, Sum(D)
ValueA1 ,         ,         ,  3
        , ValueB1  ,         , 1
        ,         , ValueC1 ,  1 
        , ValueB2  ,         , 2
        ,         , ValueC2 ,  2

Currently, my approach is to do a group by on A, B, C and get the sum of Column D. Hierarchical aggregation is being done in the application. 当前,我的方法是对A,B,C进行分组,并获得D列的总和。在应用程序中正在进行分层聚合。 I cant get the whole set of results (70 million or so) at one go, but if I used limit and offset in postgres to get data in paginated manner, there is a possibility that I might end up splitting the hierarchical data leading to ValueA been seen twice (or more) in the CSV. 我无法一次性获得全部结果(7000万左右),但是如果我使用postgres中的limit和offset来以分页的方式获取数据,那么我最终可能会拆分分层数据,从而导致ValueA在CSV中两次(或多次)被看到。

Application is built using Java and JOOQ. 应用程序是使用Java和JOOQ构建的。 The data is sent to the frontend (built using react) and CSV is written there. 数据发送到前端(使用react构建),并在其中写入CSV。

Any help regarding how to get this CSV done is appreciated. 感谢您提供有关如何完成此CSV的任何帮助。

If I understand correctly, you'd like to make sure that each time you send a chunk of data, the chunk must contain ALL rows for any given value of Column A (which is present in the chunk). 如果我理解正确,那么您希望确保每次发送数据块时,该数据块必须包含列A的任何给定值(存在于该数据块中)的所有行。 You could use the DENSE_RANK function like below - 您可以使用DENSE_RANK函数,如下所示-

SELECT *
FROM (
       SELECT
         ColumnA,
         ColumnB,
         ColumnC,
         dense_rank()
         OVER (
           ORDER BY ColumnA ASC ) AS dr,
         sum(ColumnD)             AS sumD
       FROM SomeTable
       GROUP BY ColumnA,
         ColumnB,
         ColumnC) AS sub_table
WHERE sub_table.dr BETWEEN 1 AND 5

In the last condition, you can supply the range of record numbers that you want in a chunk (the dense_rank() will increment everytime the value of ColumnA is changed) 在最后一种情况下,您可以在块中提供所需的记录编号范围(每当ColumnA的值更改时,density_rank()都会增加)

You could refer to - https://docs.aws.amazon.com/redshift/latest/dg/r_WF_DENSE_RANK.html https://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_dense_rank_WF.html 您可以参考-https : //docs.aws.amazon.com/redshift/latest/dg/r_WF_DENSE_RANK.html https://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_dense_rank_WF.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM