简体   繁体   中英

Threshold value in one-dimensional data

I have a list of similarity scores similarity_scores between two texts using some string matching method. I manually added actual_value to show if the texts were indeed similar. Is there any statistical way to find a threshold value over similarity scrore ?

similarity_scores   actual_value
1.0 1
1.0 1
1.0 1
1.0 1
0.99    1
0.99    1
0.99    1
0.989   1
0.944   1
0.944   1
0.941   1
0.941   1
0.941   1
0.941   1
0.941   0
0.934   0
0.933   0
0.933   1
0.88    1
0.784   0
0.727   0
0.727   0
0.714   0
0.714   1
0.714   0
0.714   0
0.711   0
0.711   0
0.707   0
0.707   0
0.696   0
0.696   0
0.696   0
0.696   0

A common way of determining how good a particular classification is for document retrieval is to use the precision and recall values. In your example, for a given threshold [1] :

Precision tells you what percentage of the documents above the threshold were manually tagged with a 1 value, or,

number of documents above the threshold tagged 1
------------------------------------------------
    number of documents above the threshold

Recall tells you what percentage of the documents tagged with a 1 were above the threshold:

number of documents above the threshold tagged 1
------------------------------------------------
         number of documents tagged 1

In the example you gave, you can compute these values for each possible threshold, but the only relevant ones are those in which we have transitions between sequences of zeros and ones, so I'll only look at those points:

1.0 1
1.0 1
1.0 1
1.0 1
0.99    1
0.99    1
0.99    1
0.989   1
0.944   1
0.944   1 TH=0.944 #1's=10; #0's=0
0.941   1
0.941   1
0.941   1
0.941   1
0.941   0 TH=0.941 #1's=14; #0's=1
0.934   0
0.933   0
0.933   1 TH=0.933 #1's=15; #0's=3
0.88    1 TH=0.880 #1's=16; #0's=3
0.784   0
0.727   0
0.727   0
0.714   0
0.714   1
0.714   0
0.714   0 TH=0.714 #1's=17; #0's=9
0.711   0
0.711   0
0.707   0
0.707   0
0.696   0
0.696   0
0.696   0
0.696   0

And the total number of documents tagged 1 is 17 .

Therefore, for these 5 possible thresholds TH , we have precision and recall as follows:

TH = 0.944
    precision = 10/10       = 1.000
    recall = 10/17          = 0.588
TH = 0.941
    precision = 14/15       = 0.933
    recall = 14/17          = 0.824
TH = 0.933
    precision = 15/18       = 0.833
    recall = 15/17          = 0.882
TH = 0.880
    precision = 16/19       = 0.842
    recall = 16/17          = 0.941
TH = 0.714
    precision = 17/26       = 0.654
    recall = 17/17          = 1.000

What you do with these values from here depends a great deal on your data and how sensitive the results should be to false negatives or false positives. For instance, if you want to make sure that you have as few false positives as possible, you would want to go with a threshold of TH = 0.941 or even TH = 0.944 .

If you want to balance precision and recall, you might want to go with TH = 0.880 because both measures increase from the threshold above it and precision is much better than the threshold below it. This is a rather subjective way of doing this, but we can automate it to an extent by using an F-measure . In particular, I'll use the F1-measure , but you can find one that suits your data.

The F1-measure is defined as:

F1 = 2 * precision * recall
         ------------------
         precision + recall

Using the numbers above we get:

TH = 0.944   F1 = 2*1.000*0.588/1.000+0.588 = 0.741
TH = 0.941   F1 = 2*0.933*0.824/0.933+0.824 = 0.875
TH = 0.933   F1 = 2*0.833*0.882/0.833+0.882 = 0.857
TH = 0.880   F1 = 2*0.842*0.941/0.842+0.941 = 0.889
TH = 0.714   F1 = 2*0.654*1.000/0.654+1.000 = 0.791

As you can see, by the F1 measure, TH=0.880 comes out on top with TH=0.941 not too far behind, giving very similar results to manual inspection of the possible thresholds.

[1] To clarify, I define threshold such that similarity scores greater than or equal to the threshold are considered above the threshold and similarity scores strictly less than the threshold are considered below .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM