简体   繁体   English

提高从数据库加载 100,000 条记录的性能

[英]Improve performance of loading 100,000 records from database

We created a program to make the use of the database easier in other programs.我们创建了一个程序,以便在其他程序中更轻松地使用数据库。 So the code im showing gets used in multiple other programs.所以我显示的代码被用于多个其他程序。

One of those other programs gets about 10,000 records from one of our clients and has to check if these are in our database already.其中一个程序从我们的一位客户那里获得了大约 10,000 条记录,并且必须检查这些记录是否已经在我们的数据库中。 If not we insert them into the database (they can also change and have to be updated then).如果不是,我们将它们插入数据库(它们也可以更改并且必须在那时更新)。

To make this easy we load all the entries from our whole table (at the moment 120,000), create a class for every entry we get and put all of them into a Hashmap.为了方便起见,我们从整个表中加载所有条目(目前为 120,000),为我们获得的每个条目创建一个 class,并将它们全部放入 Hashmap。

The loading of the whole table this way takes around 5 minutes.以这种方式加载整个表大约需要 5 分钟。 Also we sometimes have to restart the program because we run into a GC overhead error because we work on limited hardware.此外,我们有时不得不重新启动程序,因为我们在有限的硬件上工作时遇到了 GC 开销错误。 Do you have an idea of how we can improve the performance?您知道我们如何提高性能吗?

Here is the code to load all entries (we have a global limit of 10.000 entries per query so we use a loop):这是加载所有条目的代码(我们对每个查询有 10.000 个条目的全局限制,因此我们使用循环):

public Map<String, IMasterDataSet> getAllInformationObjects(ISession session) throws MasterDataException {
    IQueryExpression qe;
    IQueryParameter qp;
    
    // our main SDP class
    Constructor<?> constructorForSDPbaseClass = getStandardConstructor();
    
    SimpleDateFormat itaTimestampFormat = new SimpleDateFormat("yyyyMMddHHmmssSSS");
    
    // search in standard time range (modification date!)
    Calendar cal = Calendar.getInstance();
    cal.set(2010, Calendar.JANUARY, 1);
    Date startDate = cal.getTime();
    Date endDate = new Date();
    Long startDateL = Long.parseLong(itaTimestampFormat.format(startDate));
    Long endDateL = Long.parseLong(itaTimestampFormat.format(endDate));

    IDescriptor modDesc = IBVRIDescriptor.ModificationDate.getDescriptor(session);

    // count once before to determine initial capacities for hash map/set
    IBVRIArchiveClass SDP_ARCHIVECLASS = getMasterDataPropertyBag().getSDP_ARCHIVECLASS();
    qe = SDP_ARCHIVECLASS.getQueryExpression(session);
    qp = session.getDocumentServer().getClassFactory()
            .getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);        
    qp.setExpression(qe);  
    qp.setHitLimitThreshold(0);
    qp.setHitLimit(0);
    int nrOfHitsTotal = session.getDocumentServer().queryCount(session, qp, "*");
    int initialCapacity = (int) (nrOfHitsTotal / 0.75 + 1);

    // MD sets; and objects already done (here: document ID)
    HashSet<String> objDone = new HashSet<>(initialCapacity); 
    HashMap<String, IMasterDataSet> objRes = new HashMap<>(initialCapacity); 
    
    qp.close();
    
    // do queries until hit count is smaller than 10.000
    // use modification date
    
    boolean keepGoing = true;
    while(keepGoing) {
        // construct query expression
        // - basic part: Modification date & class type
        // a. doc. class type
        qe = SDP_ARCHIVECLASS.getQueryExpression(session);
        // b. ID
        qe = SearchUtil.appendQueryExpressionWithANDoperator(session, qe, 
                   new PlainExpression(modDesc.getQueryLiteral() + " BETWEEN " + startDateL + " AND " + endDateL));
        
        // 2. Query Parameter: set database; set expression
        qp = session.getDocumentServer().getClassFactory()
                .getQueryParameterInstance(session, new String[] {SDP_ARCHIVECLASS.getDatabaseName(session)}, null, null);
        
        qp.setExpression(qe);  
        
        // order by modification date; hitlimit = 0 -> no hitlimit, but the usual 10.000 max
        qp.setOrderByExpression(session.getDocumentServer().getClassFactory().getOrderByExpressionInstance(modDesc, true));
        qp.setHitLimitThreshold(0);
        qp.setHitLimit(0);

        // Do not sort by modification date;
        qp.setHints("+NoDefaultOrderBy");
        
        keepGoing = false;
        IInformationObject[] hits = null;
        IDocumentHitList hitList = null;
        hitList = session.getDocumentServer().query(qp, session);
        IDocument doc;
        if (hitList.getTotalHitCount() > 0) {
            hits = hitList.getInformationObjects();
            for (IInformationObject hit : hits) {
                String objID = hit.getID();
                if(!objDone.contains(objID)) {
                    // do something with this object and the class
                    // here: construct a new SDP sub class object and give it back via interface
                    doc = (IDocument) hit;
                    IMasterDataSet mdSet;
                    try {
                        mdSet = (IMasterDataSet) constructorForSDPbaseClass.newInstance(session, doc);
                    } catch (Exception e) {
                        // cause for this
                        String cause = (e.getCause() != null) ? e.getCause().toString() : MasterDataException.ERRMSG_PART_UNKNOWN;                            
                        throw new MasterDataException(MasterDataException.ERRMSG_NOINSTANCE_POSSIBLE, this.getClass().getSimpleName(), e.toString(), cause);
                    }                        
                    objRes.put(mdSet.getID(), mdSet);
                    objDone.add(objID);
                }                       
            }
            doc = (IDocument) hits[hits.length - 1];
            Date lastModDate = ((IDateValue) doc.getDescriptor(modDesc).getValues()[0]).getValue();
            startDateL = Long.parseLong(itaTimestampFormat.format(lastModDate));
        
            keepGoing = (hits.length >= 10000 || hitList.isResultSetTruncated());
        }
        qp.close();
    }   
    return objRes;
}

Loading 120,000 rows (and more) each time will not scale very well, and your solution may not work in the future as the record size grows.每次加载 120,000 行(以及更多)不会很好地扩展,并且随着记录大小的增长,您的解决方案将来可能无法正常工作。 Instead let the database server handle the problem.而是让数据库服务器处理问题。

Your table needs to have a primary key or unique key based on the columns of the records.您的表需要具有基于记录列的主键或唯一键。 Iterate through the 10,000 records performing JDBC SQL update to modify all field values with where clause to exactly match primary/unique key.遍历执行 JDBC SQL 更新的 10,000 条记录,以使用 where 子句修改所有字段值以完全匹配主键/唯一键。

update BLAH set COL1 = ?, COL2 = ? where PKCOL = ?; // ... AND PKCOL2 =? ...

This modifies an existing row or does nothing at all - and JDBC executeUpate() will return 0 or 1 indicating number of rows changed.这会修改现有行或根本不执行任何操作 - JDBC executeUpate()将返回 0 或 1,指示已更改的行数。 If number of rows changed was zero you have detected a new record which does not exist, so perform insert for that new record only.如果更改的行数为零,则您检测到一条不存在的新记录,因此仅对该新记录执行插入。

insert into BLAH (COL1, COL2, ... PKCOL) values (?,?, ..., ?);

You can decide whether to run 10,000 updates followed by however many inserts are needed, or do update+optional insert, and remember JDBC batch statements / auto-commit off may help speed things up.您可以决定是运行 10,000 次更新,然后再执行多少次插入,或者执行更新+可选插入,记住 JDBC 批处理语句/自动提交关闭可能有助于加快速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM