简体   繁体   中英

string comparison in python but not Levenshtein distance (I think)

I found a crude string comparison in a paper I am reading done as follows:

The equation they use is as follows (extracted from the paper with small word changes to make it more general and readable) I have tried to explain a bit more in my own words since the description by the author is not very clear (using an example by the author)

For example for 2 sequences ABCDE and BCEFA, there are two possible graphs

graph 1) which connects B with BC with C and E with E

graph 2) connects A with A

I cannot connect A with A when I am connecting the other three (graph 1) since that would be crossing lines (imagine you draw lines between BB, CC and EE); that is the line inking AA will cross the lines linking BB, CC and EE. So these two sequences result in 2 possible graphs; one has 3 connections (BB, CC and EE) and the other only one (AA) then I calculate the score d as given by the equation below.

Consequently, to define the degree of similarity between two penta-strings we calculate the distance d between them. Aligning the two penta-strings, we look for all the identities between their characters, wherever these may be located. If each identity is represented by a link between both penta-strings, we define a graph for this pair. We call any part of this graph a configuration.

Next, we retain all of those configurations in which there is no character cross pairing (the meaning is explained in my example above, ie, no crossings of links between identical characters and only those graphs are retained). Each of these is then evaluated as a function of the number p of characters related to the graph, the shifting Δi for the corresponding pairs and the gap δij between connected characters of each penta-string. The minimum value is chosen as characteristic and is called distance d: d Min(50 – 10p + ΣΔi + Σδij) Although very rough, this measure is generally in good agreement with the qualitative eye guided estimation. For instance, the distance between abcde and abcfg is 20, whereas that between abcde and abfcg is 23 =(50 – 30 + 1 +2).

I am confused as to how to go about doing this. Any suggestions to help me would be much appreciated.

I tried the Levenshtein and also simple sequence alignment as used in protein sequence comparison The link to the paper is: http://peds.oxfordjournals.org/content/16/2/103.long

I could not find any information on the first author, Alain Figureau and my emails to MA Soto have not been answered (as of today).

Thank you

Well, it's definitely not Levenshtein:

>>> from nltk import metrics
>>> metrics.distance.edit_distance('abcde','abcfg')
2
>>> metrics.distance.edit_distance('abcde','abfcg')
3
>>> help(metrics.distance.edit_distance)
Help on function edit_distance in module nltk.metrics.distance:

edit_distance(s1, s2)
    Calculate the Levenshtein edit-distance between two strings.
    The edit distance is the number of characters that need to be
    substituted, inserted, or deleted, to transform s1 into s2.  For
    example, transforming "rain" to "shine" requires three steps,
    consisting of two substitutions and one insertion:
    "rain" -> "sain" -> "shin" -> "shine".  These operations could have
    been done in other orders, but at least three steps are needed.

    @param s1, s2: The strings to be analysed
    @type s1: C{string}
    @type s2: C{string}
    @rtype C{int}

Just after the text block you cite, there is a reference to a previous paper from the same authors : Secondary Structure of Proteins and Three-dimensional Pattern Recognition . I think it is worth to look into it if there is no explanantion of the distance (I'm not at work so I haven't the access to the full document).

Otherwise, you can also try to contact directly the authors : Alain Figureau seems to be an old-school French researcher with no contact whatsoever (no webpage, no e-mail, no "social networking",..) so I advise to try contacting MA Soto , whose e-mail is given at the end of the paper. I think they will give you the answer you're looking for : the experiment's procedure has to be crystal clear in order to be repeatable, it's part of the scientific method in research.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM