简体   繁体   English

使用Java传输大量数据

[英]Transfer Huge data using Java

I have a requirement of transferring huge amount of data (nearly 10 million records) from one database(oracle) to another database(postgres) using Java Program. 我需要使用Java程序将大量数据(近1000万条记录)从一个数据库(oracle)传输到另一个数据库(postgres)。 I have done the same by creating connections with the two DB and queried the data from the source db and then inserted the data into destination db by iterating the result set. 我通过创建与两个DB的连接并从源数据库查询数据然后通过迭代结果集将数据插入目标数据库来完成相同的操作。 But It's taking huge time to transfer the data. 但这需要花费大量时间来传输数据。 Is there any way to do the transfer process quickly? 有没有办法快速完成转移过程?

One alternative would just be to export all of the data in the table into a text file and then import that data directly into postgres. 另一种方法是将表中的所有数据导出到文本文件中,然后将这些数据直接导入postgres。 I don't remember what export functionality oracle has, but worst case you can always just use a query like this, dumping the contents of the table as a set of insert statements: 我不记得oracle有什么导出功能,但最糟糕的情况是你总是可以使用这样的查询,将表的内容作为一组insert语句转储:

select 'insert into mynewtable values(' || old.a || ', ' || old.b || ...etc..|| ');' 选择'insert into mynewtable values('|| old.a ||','|| old.b || ... etc .. ||');' from myoldtable old; 来自myoldtable old;

I've definitely processed 10MM records from an Oracle database (using Java) within a period of a few hours (with a lot of processing between each record). 我肯定在几个小时的时间内从Oracle数据库(使用Java)处理了10MM记录(每个记录之间有大量处理)。 What kind of performance are you hoping for and what are you getting now? 你希望什么样的表现,你现在得到什么?

Do you have a lot of indexes or constraints being checked as you insert into the postgres table? 在插入postgres表时,是否有大量索引或约束被检查? Maybe something else is wrong with your code? 你的代码可能还有其他问题吗? If you know that all of the rows are valid, maybe you should drop the constraints in the postgres db while doing your inserts? 如果你知道所有的行都是有效的,也许你应该在插入时删除postgres db中的约束?

Or, if you haven't in a while, maybe you need to vacuum the database? 或者,如果你有一段时间没有,也许你需要真空吸尘数据库?

If you are limited to single threaded read data, write data, there's not a whole lot of room for improvement. 如果您仅限于单线程读取数据,写入数据,那么就没有很大的改进空间。

This type of performance is limited by a couple of different things, the amount of data your moving across the wire, the speed of your network, database indexing and configuration as well as some other things in the network / host. 这种类型的性能受到一些不同的限制,即您在线路上移动的数据量,网络速度,数据库索引和配置以及网络/主机中的其他一些内容。

At a minimum, you should be setting your read connection up with a larger fetchsize. 至少,您应该使用更大的fetchsize设置读取连接。

ResultSet rs;
...
rs.setFetchSize(500);

On the insert side, you should also look at batching using a CallableStatement 在插入方面,您还应该查看使用CallableStatement进行批处理

CallableStatement cs;
Connection conn;
conn.setAutoCommit(false);
... 
cs.addBatch();

if (rowCount % batchsize == 0) {
   int[] updateCounts = cs.executeBatch();

   conn.commit();
   batchCount = 0;

   for (int i = 0; i < updateCounts.length; i++) {
        if (updateCounts[i] < 1)
           bad.write(batchRec[i]);
   }
}

There are other things you can do in Oracle for insert performance, one of which is setting up a bulk load using a named pipe, then your process can write to that named pipe. 您可以在Oracle中执行其他一些插入性能,其中一个是使用命名管道设置批量加载,然后您的进程可以写入该命名管道。 They are non-logged operations, so it's pretty fast. 它们是非记录操作,所以速度非常快。 I haven't done the named pipe thing from Java, so it's something to look into, but that should get you going. 我还没有从Java那里完成命名管道的事情,所以这是值得研究的东西,但这应该让你前进。

You need to figure out where your bottleneck is. 你需要弄清楚你的瓶颈在哪里。 I have seen performance dwindle over time, because of the query is table scanning on some table, and it takes longer to retrieve data for the later rows than the earlier rows. 我已经看到性能随着时间的推移逐渐减少,因为查询是在某些表上进行表扫描,并且检索后续行的数据需要更长的时间。

Like anything else, you need to start to introduce timing, to see if your select starts to take more time or if the read performance is pretty stable (a good indication of table scanning if the later row fetches take longer than earlier row fetches). 与其他任何事情一样,您需要开始引入时序,以查看您的选择是否开始花费更多时间或者读取性能是否相当稳定(如果后面的行提取比先前的行提取花费的时间更长,则可以很好地指示表扫描)。

Lastly, if you can break the query down neatly, you can employ multiple worker threads to process the data in parallel. 最后,如果您可以整齐地中断查询,则可以使用多个工作线程并行处理数据。

ie. 即。 instead of 代替

select a,b,c from source table

You would break it down like 你会打破它

select a,b,c from source table where a < 10;
select a,b,c from source_table where a >= 10 and a < 50;
select a,b,c from source_table where a >= 50;

Like anything else, there's a hundred ways to do things. 像其他任何事情一样,有一百种方法可以做。

The problem here is programming languages use cursors to handle sets of tuples, cursors can only iterate through them you cant do bulk loading or anything like that, this is true for I think every programming languages, a faster solution would be connecting Oracle to PostgreSQL somehow, I'm not sure how to do that but I think it should be possible. 这里的问题是编程语言使用游标来处理元组集,游标只能迭代它们你不能进行批量加载或类似的东西,这是真的,因为我认为每种编程语言,更快的解决方案是以某种方式将Oracle连接到PostgreSQL ,我不知道该怎么做,但我认为应该可行。 there are Oracle functions for everything. 一切都有Oracle功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM