简体   繁体   中英

Strange behavior for java class CachedRowSetImpl

I have so difficulty to use the CachedRowSetImpl class in java. I want to analyse the data of a huge postgres table, that contains ~35,000,000 lines and 3 integer columns.

I cannot load everything into my computer physical memory, then I want to read these lines per batch of 100000 lines. When executing the corresponding query (select col1,col2,col3 from theTable limit 10000) in psql prompt or in a graphical interface such as pgadmin, it takes around 4000ms to load the 100000 lines and a few megabytes of memory.

I try to do the same operation with the following java code:

CachedRowSet rowset = new CachedRowSetImpl();
int pageSize=1000000;
rowset.setCommand("select pk_lib_scaf_a,pk_lib_scaf_b,similarity_evalue from from_to_scaf");
rowset.setPageSize(pageSize);
rowset.setReadOnly(true);
rowset.setFetchSize(pageSize);
rowset.setFetchDirection(ResultSet.FETCH_FORWARD);
rowset.execute(myConnection);

System.out.println("start !");

while (rowset.nextPage()) {
    while (rowset.next()) {
        //treatment of current data page

    } // End of inner while
    rowset.release();
} 

When running the above code, the "start !" message is never displayed in the console and the execution seems to be stuck in the rowset.execute() line. Moreover, the memory consumption gets crazy and reach the limit of my computer physical memory (8gb).

That's strange, it looks like the program tries to fill the rowset with the ~35,000,000 lines, without considering the pageSize configuration.

Does anybody experienced such problem with java JDBC and postgres drivers ? What do I miss ?

postgres 9.1 java jdk 1.7

From the CachedRowSet Javadoc (emphasis mine):

A CachedRowSet object is a disconnected rowset, which means that it makes use of a connection to its data source only briefly. It connects to its data source while it is reading data to populate itself with rows and again while it is propagating changes back to its underlying data source . The rest of the time, a CachedRowSet object is disconnected, including while its data is being modified.

From your question:

it looks like the program tries to fill the rowset with the ~35,000,000 lines, without considering the pageSize configuration

Yes, CachedRowSet will retrieve the 35m rows from your database at once, and after that it will apply the pagination and other defined properties. A possible solution would be processing the data by small chunks and having a flag on each row to mark it as processed.

I would recommend using an ETL tool like Pentaho that already handles this kind of problems.

In fact, the support of cursor is implicitly coded in the postgres JDBC as described in its documentation. A cursor is however created automatically with some conditions.

http://jdbc.postgresql.org/documentation/head/query.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM