简体   繁体   中英

Deduplication with scoring framework/application/server on Java to work with database input staging

Please suggest me Java product (I would prefer open-source) which does do:

  1. data deduplication
  2. deduplication scoring
  3. allows to customize deduplication rules and scoring rules.

Please see the example:

  1. I have an input staging database named "INPUT_DB"
  2. I have a table named "INPUT_PERSONS"
  3. There are several fields in this table:

    ID (some meaningless surrogate primary key)
    FIRST_NAME
    LAST_NAME
    SECOND_NAME
    BIRTH_DATE
    PASSPORT_SERIES (PASSPORT_SERIES + PASSPORT_NUM is a unique identifier of a citizen)
    PASSPORT_NUM

I have to look through all records in INPUT_PERSONS and find duplicates and matches. Several rules should be created:

  1. if PASSPORT_SERIES+PASSPORT_NUM equals to some record it means these two records are duplicates. The score for such situation is 100 out of 100
  2. If FIRST_NAME, LAST_NAME are equal, but PASSPORT_SERIES+PASSPORT_NUM has one different character (misprint for example), then these records are possible duplicates and their score is 90 out of 100.
  3. And so on....

Is it possible to find some ready solution and use it as a base?

Ive done this in the past and based it on the fellEgi-sunter algo. See this question: Is there a open source implementation for Fellegi-Sunter?

DUKE项目可以满足您的要求: https : //github.com/larsga/Duke

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM