简体   繁体   中英

Most appropriate analysis method - Clustering?

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.

There is an index in both dataframes though this index indicates locality of the survey (ie region) and not individual IDs. Merging is not possible as in most cases there is an identical index values for different participants.

Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.

What do you think would be the appropriate method for doing this? Clustering?

Best, James

That obviously is not clustering. You don't want large groups of records.

What you want to do is an approximate JOIN.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM