简体   繁体   English

关于批处理数据库记录的建议

[英]suggestion on batch processing db records

I am working on developing a webapp (visual jsf, ejb3, hibernate on glassfish/mysql) that works with an existing legacy system. 我正在开发可与现有遗留系统一起使用的webapp(可视化jsf,ejb3,glassfish / mysql上的休眠)。

I have an 'employee' table (with fields such as name (String), location (String), etc.) which is separate from an employee table on the legacy db2 side. 我有一个“员工”表(具有名称(字符串),位置(字符串)等字段),它与旧版db2端的雇员表分开。 I do all of the webapp processing with my employee table. 我用我的雇员表来完成所有的webapp处理。 However, every week I need to schedule a task to go through all the employees in my table and compare them against the employees in the legacy db2 table. 但是,每周我都需要安排一个任务以遍历表中的所有员工,并将其与旧版db2表中的员工进行比较。 If the employee location has changed in the legacy table, I need to update my employee table to reflect the new location. 如果旧表中的员工位置已更改,我需要更新员工表以反映新位置。

What would you suggest as the best way to go about doing this? 您认为这样做最好的方法是什么?

Currently I am reading in all the employees into an ArrayList and then looping through each employee entity in the list, getting the corresponding legacy employee instance, comparing locations and updating my employee entity if location change detected. 目前,我正在将所有员工读入ArrayList,然后遍历列表中的每个员工实体,获取相应的旧员工实例,比较位置并在检测到位置更改时更新我的​​员工实体。

Since I have close to 50000 records in my employee table, the initial build of the ArrayList takes around 5 minutes and this employee number will only keep on increasing. 由于我的雇员表中有近50000条记录,因此ArrayList的初始构建大约需要5分钟,而这个雇员数只会不断增加。

Is there a reason why it should be synched only once in a week? 是否有理由为什么每周仅同步一次? If not, you might want to spread the operation over the week - do 1/7-th of the work every day. 如果没有,您可能希望将操作分散在一周内-每天做1/7的工作。 You can also consider adding a table in your side to keep track of which record was synched when. 您还可以考虑在自己的旁边添加一个表,以跟踪何时同步了哪个记录。

I would create a dblink ( dblinks do exist on DB2 right? ) and do something like: 我将创建一个dblink(dblink是否确实存在于DB2上吗?),并执行以下操作:

 select 
     a.id, a.location 
 from 
      empl a, empl@link b 
 where 
     a.id = b.id 
     and a.location <> b.location

Then iterate the resultset which will have all those whose location have changed. 然后迭代结果集,该结果集将包含所有位置已更改的结果集。

If you have the ability to alter the legacy table in any way, you could add a needs_sync column to it. 如果您能够以任何方式更改旧表,则可以向其中添加needs_sync列。 Then, using a trigger or modifying the code that updates the location, set needs_sync = 1 when you do the update. 然后,使用触发器或修改更新位置的代码,在执行更新时设置needs_sync = 1。 (Add an index on that column, too.) (也在该列上添加索引。)

Then, to find records to update 然后,找到要更新的记录

select id, location
from legacy.employee
where needs_sync = 1

When you've successfully done the sync 成功完成同步后

update employee
set needs_sync = 0
where needs_sync = 1

Do it all in a transaction to avoid a race condition. 在事务中完成所有操作,以避免出现竞争状况。

This solution has the advantage of only examining records which have been changed, so it will be efficient at runtime. 该解决方案的优点是仅检查已更改的记录,因此在运行时将非常有效。 It does require a change in the legacy schema, which might be painful or impossible to do. 它确实需要对旧架构进行更改,这可能很痛苦或无法做到。

Im thinking of using jpa query's "setMaxResults()" and "setFirstResults()" methods to retrieve employee data in chunks. 我正在考虑使用jpa查询的“ setMaxResults()”和“ setFirstResults()”方法来分块检索员工数据。 These methods are used for paginating display data in the UI, so I dont see any reason why I cant do the same. 这些方法用于在UI中对显示数据进行分页,因此我看不到任何无法执行此操作的原因。 This way I can process chunks at a time. 这样,我可以一次处理块。 And I could probably throw in a queue and mdb for processing the chunks in parallel since I cant create threads within the ejb container. 由于我无法在ejb容器内创建线程,因此我可能会抛出一个队列和mdb以并行处理这些块。

I am thinking of using JMS messages, queues and MDBs to try and solve this problem instead. 我正在考虑使用JMS消息,队列和MDB来尝试解决此问题。 I would send each employee record as a separate message to a queue and then, the corresponding MDB can do all the processing and updating for that record. 我会将每个员工记录作为单独的消息发送到队列,然后相应的MDB可以对该记录进行所有处理和更新。 I am thinking I might get more simultaneous multiprocessing done that way. 我想我可能会以这种方式完成更多的同时多处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM