简体   繁体   中英

How to interpret the P numbers that fairseq generate produces?

Using fairseq-generate.py, with the transformer architecture, each translation produces a section like this:

Why is it rare to discover new marine mammal species?
S-0     Why is it rare to discover new marine mam@@ mal species ?
H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015

With this explanation :

H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker

I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word? Eg does -0.07 for "Pourquoi" means it was happier about that than it was (-0.1849) for "est-il"? And the low -0.0015 at the end means it was really confident the sentence should end there.

Background: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation. I've been analyzing a handful of translations against the H number and didn't notice much correspondence between it and my subjective opinion of translation quality. But I've a couple where I thought it was particularly poor - it had missed a bit of key information - and the final P number was a relatively high -0.6099 and -0.3091 (The final P number is -0.11 or so on most of them.)

Q: I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word?

  • Yes. As the docs says, " P is the positional score per token position ". The score is actually the log probability, therefore the higher (ie, the lower absolute number) the more "confident". The source-code may not be that easy to follow, but the scores are generated by the SequenceScorer , and there you can see that scores are normalized (which includes a log either if when you're using a single model or an ensemble ). Moreover, when printing the scores, they convert them from base e to 2 :

     print('P-{}\\t{}'.format( sample_id, ' '.join(map( lambda x: '{:.4f}'.format(x), # convert from base e to base 2 hypo['positional_scores'].div_(math.log(2)).tolist(), ))

Q: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation.

  • It turns out that the H value is simply the average of the P values, as you can see here :

     score_i = avg_probs_i.sum() / tgt_len

    also converted to base 2 . You can check that in your example:

     import numpy as np print(np.mean([-0.0763,-0.1849 ,-0.0956 ,-0.0946 ,-0.0735 ,-0.1150 ,-0.1301 ,-0.0042 ,-0.0321 ,-0.0171 ,-0.0052 ,-0.0062 ,-0.0015])) # >>> -0.06433076923076922

    Another measurement that is often used to assess the performance of a language model is Perplexity . And a good thing is that perplexity can be easily computed based on the P values, as shown in the Language Model example of the fairseq repository:

     # Compute perplexity for a sequence en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores'].mean().neg().exp() # tensor(15.1474)

    I'm not an expert on NLP, so I can't really tell you which one you should use in your case.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM