简体   繁体   中英

Find matching records with least characters from Pattern - Oracle / Java

The web application I am working currently has an File import logic. The logic

1> reads the records from a file [excel or txt],
2> shows a non editable grid of all the records imported [New records are marked as New if they do not exist in the database and existing records are marked as Update] and
3> dumps the records in the database.

a file containing contacts with following format in the file (mirrors the columns in the database with primary keys First_Name, Last_Name ):

First_Name, Last_Name, AddressLine1, AddressLine2, City, State, Zipcode

The issue we are running into is when there are different values for the same entity being entered in the file. example, Someone might type NY for New York while others would put in New York. Same applies to first name or last name ex. John Myers and John Myer refer to the same person, but because the record does not match exactly, it inserts the record rather than reusing it for an update.

Example, for the record from the file ( Please note the name and address usage is purely coincidental :) ):

John, Myers, 44 Chestnut Hill, Apt 5, Indiana, Indiana, 11111

and the record in the database:

John, Myer, 80 Washington St, Apt 1, Chicago, IL, 3333

the system should have detected the record in the file as existing record [because of the last name being Myers and Myer and since first name matches completely] and do an update on the Address, but rather inserts a new value.

How can I approach this issue where I would want to find all the records that would perform the match on the existing records in the database?

It is a very difficult problem to solve, if you know the sources of your data, then you could attempt to manually rectify the different combinations of data input.

Else

you could try for phonetic data cleaning solutions

One solution I could think of is using Regex in Oracle to achieve the functionality upto some extent.

For each of the column, I would generate Regex expression half way through the String length. example, for the name "Myer" in the file and "Myers" in the database, following query would work:

SELECT Last_Name from Contacts WHERE (Last_Name IS NULL OR Regexp_Like(Last_Name, '^Mye?r?$'))

I would consider this as a partial solution because I would parse the input string and start appending the none or only one operator from half the length to the end of the string and hoping the input string is not so messed up.

Hoping to find some feedback from others on SO for this "solution".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM