I have a list of similarity scores similarity_scores
between two texts using some string matching method. I manually added actual_value
to show if the texts were indeed similar. Is there any statistical way to find a threshold value over similarity scrore
?
similarity_scores actual_value
1.0 1
1.0 1
1.0 1
1.0 1
0.99 1
0.99 1
0.99 1
0.989 1
0.944 1
0.944 1
0.941 1
0.941 1
0.941 1
0.941 1
0.941 0
0.934 0
0.933 0
0.933 1
0.88 1
0.784 0
0.727 0
0.727 0
0.714 0
0.714 1
0.714 0
0.714 0
0.711 0
0.711 0
0.707 0
0.707 0
0.696 0
0.696 0
0.696 0
0.696 0
A common way of determining how good a particular classification is for document retrieval is to use the precision and recall values. In your example, for a given threshold [1] :
Precision tells you what percentage of the documents above the threshold were manually tagged with a 1
value, or,
number of documents above the threshold tagged 1
------------------------------------------------
number of documents above the threshold
Recall tells you what percentage of the documents tagged with a 1
were above the threshold:
number of documents above the threshold tagged 1
------------------------------------------------
number of documents tagged 1
In the example you gave, you can compute these values for each possible threshold, but the only relevant ones are those in which we have transitions between sequences of zeros and ones, so I'll only look at those points:
1.0 1
1.0 1
1.0 1
1.0 1
0.99 1
0.99 1
0.99 1
0.989 1
0.944 1
0.944 1 TH=0.944 #1's=10; #0's=0
0.941 1
0.941 1
0.941 1
0.941 1
0.941 0 TH=0.941 #1's=14; #0's=1
0.934 0
0.933 0
0.933 1 TH=0.933 #1's=15; #0's=3
0.88 1 TH=0.880 #1's=16; #0's=3
0.784 0
0.727 0
0.727 0
0.714 0
0.714 1
0.714 0
0.714 0 TH=0.714 #1's=17; #0's=9
0.711 0
0.711 0
0.707 0
0.707 0
0.696 0
0.696 0
0.696 0
0.696 0
And the total number of documents tagged 1
is 17
.
Therefore, for these 5 possible thresholds TH
, we have precision
and recall
as follows:
TH = 0.944
precision = 10/10 = 1.000
recall = 10/17 = 0.588
TH = 0.941
precision = 14/15 = 0.933
recall = 14/17 = 0.824
TH = 0.933
precision = 15/18 = 0.833
recall = 15/17 = 0.882
TH = 0.880
precision = 16/19 = 0.842
recall = 16/17 = 0.941
TH = 0.714
precision = 17/26 = 0.654
recall = 17/17 = 1.000
What you do with these values from here depends a great deal on your data and how sensitive the results should be to false negatives or false positives. For instance, if you want to make sure that you have as few false positives as possible, you would want to go with a threshold of TH = 0.941
or even TH = 0.944
.
If you want to balance precision and recall, you might want to go with TH = 0.880
because both measures increase from the threshold above it and precision is much better than the threshold below it. This is a rather subjective way of doing this, but we can automate it to an extent by using an F-measure . In particular, I'll use the F1-measure
, but you can find one that suits your data.
The F1-measure
is defined as:
F1 = 2 * precision * recall
------------------
precision + recall
Using the numbers above we get:
TH = 0.944 F1 = 2*1.000*0.588/1.000+0.588 = 0.741
TH = 0.941 F1 = 2*0.933*0.824/0.933+0.824 = 0.875
TH = 0.933 F1 = 2*0.833*0.882/0.833+0.882 = 0.857
TH = 0.880 F1 = 2*0.842*0.941/0.842+0.941 = 0.889
TH = 0.714 F1 = 2*0.654*1.000/0.654+1.000 = 0.791
As you can see, by the F1 measure, TH=0.880
comes out on top with TH=0.941
not too far behind, giving very similar results to manual inspection of the possible thresholds.
[1] To clarify, I define threshold such that similarity scores greater than or equal to the threshold are considered above the threshold and similarity scores strictly less than the threshold are considered below .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.