简体   繁体   中英

Finding out sequence similarity in arrays

I have a task where I have three arrays A,B,C. All of the contain the same data. For the sake of simplicity lets assume the data is numbers 1 to 5. The data would be in different jumbled sequences. I want to find out among B & C which array has data most similar to A.

Eg: 
A = 1,2,3,4,5
B = 1,2,3,5,4
C = 4,1,2,3,5

In this case, it is easy to visually comprehend that B is more similar to A. But it gets more complicated for really jumbled sequences.

Eg: 
A = 1,2,3,4,5
B = 5,3,1,4,2
C = 4,1,2,3,5

In this case, I would assume C to be more closer to A. I am thinking that this assumption can be quantified as: How many elements have the same sequence in both arrays? In above example the subsequence of [1,2,3] is the same in both arrays. The second question would be what is the offset difference between the similar subsequence ? In this case it is 1, because the subsequence begins at index 0 for A and index 1 for C.

So the number of elements in a matching sequence and their offsets are what I am thinking to use. I plan on adding a weightage to these two entities (number of elements in matching sequence, and offset difference in their occurrence)

Does this make sense? I only need a rough approximation of similarity and the results do not need to be exact. Are there any formal mathematical or data-structure models that solve this problem?

BTW, the project where I need this implemented is in PHP. Does it have any inbuilt functions like the levenstein model for string difference?

Any suggestions are very welcome!

Well I suppose you can come up with your own algorithm (for instance generate all suffixes and then search for them and then define a scoring procedure) or you could use a well known algorithm like
Smith-Waterman for local alignment or Needleman-Wunsch for global. The advantage of these algorithms is that they are well-understood and give you all the possible alignments (and you can choose the best for your case).

NW in PHP

SW in PHP

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM