简体   繁体   中英

Comparing the similarity of nested lists

I have a list of 3 lists that each have 1 list inside each of them.

data_set = [
    ['AB12345',['T','T','C','C','A','C','A','G','C','T','T','T','T','C']],
    ['AB12346',['T','T','C','C','A','C','C','G','C','T','C','T','T','C']],
    ['AB12347',['T','G','C','C','A','C','G','G','C','T','T','C','T','C']]
]

I have a compare method that will give me the similarities of the list which contains the characters, not the id.

def compare(_from, _to):
    similarity = 0
    length = len(_from)
    if len(_from) != len(_to):
        raise Exception("Cannot be compared due to different length.")
    for i in range(length):
        if _from[i] == _to[i]:
            similarity += 1
    return similarity / length * 100

compare(data_set[0][1], data_set[1][1])

By using the compare method, i used a for loop to compare the "a" list with other lists as in "a" compare to "a", "a" compare to "b", and "a" compare to "c".

for i in range(len(data_set)):
    data_set[i].append(compare(data_set[0][1], data_set[i][1]))
    print(round(data_set[i][2], 2), end=", ")

But after finish comparing the 1st list with other lists and itself, how do i loop to the 2nd list and 3rd list and continue compare with other lists again to get their similarities? Like, ( "b" compare to "a", "b" compare to "b" and "b" compare to "c" ) and ( "c" compare to "a", "c" compare to "b" and "c" compare to "c" ).

Just use a second nested loop like that

for i in range(len(data_set)):
    for j in range(len(data_set)):
        data_set[i].append(compare(data_set[j][1], data_set[i][1]))
        print(round(data_set[i][2], 2), end=", ")

For future reference, it's better to include your input lists (a,b,c) in your code instead of using a screen shot to save people having to type out the whole lists. I used some shorter versions for testing.

You could do something like the following to iterate through both lists and compare the results. This is neater than using for i in range(len(data_set)):

# Make some test data
a= ["ID_A", ['T', 'G', 'A']]
b= ["ID_B", ['T', 'C', 'A']]
c= ["ID_C", ['C', 'A', 'A']]

data = [a,b,c]

# entry1 takes each of the values a,b,c in order, and entry2 will do the same,
# so you'll have all possible combinations.
for entry1 in data:
    for entry2 in data:
        score = compare(entry1[1], entry2[1])
        print("Compare ", entry1[0], " to ", entry2[0], "Score :", round(score))

Output:

Compare  ID_A  to  ID_A  Score : 100
Compare  ID_A  to  ID_B  Score : 67
Compare  ID_A  to  ID_C  Score : 33
Compare  ID_B  to  ID_A  Score : 67
Compare  ID_B  to  ID_B  Score : 100
Compare  ID_B  to  ID_C  Score : 33
Compare  ID_C  to  ID_A  Score : 33
Compare  ID_C  to  ID_B  Score : 33
Compare  ID_C  to  ID_C  Score : 100

You're probably better off storing the scores in a different array than the one you're keeping your lists in.

You could also use itertools.combinations to compare all of your sublists. Also, in your compare() function you may want to consider returning a value that indicates sublists are not comparable rather than raising an exception so that you don't short circuit your loop prematurely when comparing a larger set of sublists.

Following is an example (also includes a slightly simpler version of your compare() function that returns -1 when lists are not comparable due to length, but does not perform the comparison of a list against itself since the return value will always be 100 in that situation and seems to be a performance waste).

import itertools

data_set = [
    ['AB12345',['T','T','C','C','A','C','A','G','C','T','T','T','T','C']],
    ['AB12346',['T','T','C','C','A','C','C','G','C','T','C','T','T','C']],
    ['AB12347',['T','G','C','C','A','C','G','G','C','T','T','C','T','C']]
    ]

def compare(a, b):
    length = len(a) if len(a) == len(b) else 0
    similarity = sum(1 for i in range(length) if a[i] == b[i])
    return similarity / length * 100 if length else -1

for a, b in itertools.combinations(data_set, 2):
    compared = a[0] + ' and ' + b[0]
    result = compare(a[1], b[1])
    print(f'{compared}: {result}')

# OUTPUT
# AB12345 and AB12346: 85.71428571428571
# AB12345 and AB12347: 78.57142857142857
# AB12346 and AB12347: 71.42857142857143

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM