Using fairseq-generate.py, with the transformer architecture, each translation produces a section like this:
Why is it rare to discover new marine mammal species?
S-0 Why is it rare to discover new marine mam@@ mal species ?
H-0 -0.0643349438905716 Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0 -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015
With this explanation :
H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker
I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word? Eg does -0.07 for "Pourquoi" means it was happier about that than it was (-0.1849) for "est-il"? And the low -0.0015 at the end means it was really confident the sentence should end there.
Background: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation. I've been analyzing a handful of translations against the H number and didn't notice much correspondence between it and my subjective opinion of translation quality. But I've a couple where I thought it was particularly poor - it had missed a bit of key information - and the final P number was a relatively high -0.6099
and -0.3091
(The final P number is -0.11
or so on most of them.)
Q: I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word?
Yes. As the docs says, " P is the positional score per token position ". The score is actually the log probability, therefore the higher (ie, the lower absolute number) the more "confident". The source-code may not be that easy to follow, but the scores are generated by the SequenceScorer
, and there you can see that scores are normalized (which includes a log
either if when you're using a single model or an ensemble ). Moreover, when printing the scores, they convert them from base e to 2 :
print('P-{}\\t{}'.format( sample_id, ' '.join(map( lambda x: '{:.4f}'.format(x), # convert from base e to base 2 hypo['positional_scores'].div_(math.log(2)).tolist(), ))
Q: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation.
It turns out that the H value is simply the average of the P values, as you can see here :
score_i = avg_probs_i.sum() / tgt_len
also converted to base 2 . You can check that in your example:
import numpy as np print(np.mean([-0.0763,-0.1849 ,-0.0956 ,-0.0946 ,-0.0735 ,-0.1150 ,-0.1301 ,-0.0042 ,-0.0321 ,-0.0171 ,-0.0052 ,-0.0062 ,-0.0015])) # >>> -0.06433076923076922
Another measurement that is often used to assess the performance of a language model is Perplexity . And a good thing is that perplexity can be easily computed based on the P values, as shown in the Language Model example of the fairseq repository:
# Compute perplexity for a sequence en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores'].mean().neg().exp() # tensor(15.1474)
I'm not an expert on NLP, so I can't really tell you which one you should use in your case.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.