简体   繁体   中英

Data matching Algorithm Approach

I don't really know where to start with this project, and so I'm hoping a broad question can at least point me in the right direction.

I have 2 data sets right now, each about 5gb with 2million observations. They are the assessed and historical data gathered for property listings of a given area for a certain amount of time. What I need to do is match properties to one another. So a property may arise in the historical since it gets sold 2 or 3 times during the period. In this historical I have the seller info, the loan info, and sale info. In the assessor data I have all of the characteristics that would describe the property sold. So in order to do any pricing model, I need to match the two.

I have variables that are similar in each, however they are going to differ slightly (misspellings, abbreviations, etc). Does anyone have any recommendations for me about going through this? First off, what program would I want to do this in? I have experience in STATA, R and a little bit of SAS and Matlab, but I'd prefer to use the former two.

I read through this:

Data matching algorithm

Where he uses .NET and one user suggested a Levenshtein approach (where the distance between strings is calculated) so for fields like Address I could use this and weight the approximate accuracy between the two string. Then it was suggested maybe to use Soundex for maybe Name of the seller/owner.

But I'm really lost in how to implement any of this, and before I approach anyone in my department I really need to have some sort of idea of what I'm doing!

Any help or advice would be immensely helpful.

Yes, there are several good algorithms for the string matching problem you describe, namely:

  • jaro-winkler,
  • smith-waterman,
  • dice-sorense
  • soundex
  • damerau-levenshtein, and
  • monge-elkan to name the few.

I recommend A Comparison of String Distance Metrics for Name-Matching Tasks , by WW Cohen, P. Ravikumar, S. Fienberg for an overview of what might be working the best for what.

SoftTFIDF claims to be the best one. It is available as a Java package. There are other implementations of string matching and record linkage algorithms available in:

libraries.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM