How to interpret the P numbers that fairseq generate produces?

Question

Using fairseq-generate.py, with the transformer architecture, each translation produces a section like this:

Why is it rare to discover new marine mammal species?
S-0     Why is it rare to discover new marine mam@@ mal species ?
H-0     -0.0643349438905716     Pourquoi est-il rare de découvrir de nouvelles espèces de mammifères marins?
P-0     -0.0763 -0.1849 -0.0956 -0.0946 -0.0735 -0.1150 -0.1301 -0.0042 -0.0321 -0.0171 -0.0052 -0.0062 -0.0015

With this explanation :

H is the hypothesis along with an average log-likelihood; and P is the positional score per token position, including the end-of-sentence marker

I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word? Eg does -0.07 for "Pourquoi" means it was happier about that than it was (-0.1849) for "est-il"? And the low -0.0015 at the end means it was really confident the sentence should end there.

Background: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation. I've been analyzing a handful of translations against the H number and didn't notice much correspondence between it and my subjective opinion of translation quality. But I've a couple where I thought it was particularly poor - it had missed a bit of key information - and the final P number was a relatively high -0.6099 and -0.3091 (The final P number is -0.11 or so on most of them.)

Answer 1

Q: I'm wondering if it is reasonable to say a low (absolute) number in the P row means higher confidence in that particular word?

Yes. As the docs says, " P is the positional score per token position ". The score is actually the log probability, therefore the higher (ie, the lower absolute number) the more "confident". The source-code may not be that easy to follow, but the scores are generated by the SequenceScorer , and there you can see that scores are normalized (which includes a log either if when you're using a single model or an ensemble ). Moreover, when printing the scores, they convert them from base e to 2 :
```
 print('P-{}\\t{}'.format( sample_id, ' '.join(map( lambda x: '{:.4f}'.format(x), # convert from base e to base 2 hypo['positional_scores'].div_(math.log(2)).tolist(), ))
```

Q: What I'm trying to work out is if I can use either the H number, or somehow to use the individual P numbers, to get a confidence measure in its translation.

It turns out that the H value is simply the average of the P values, as you can see here :
```
 score_i = avg_probs_i.sum() / tgt_len
```
also converted to base 2 . You can check that in your example:
```
 import numpy as np print(np.mean([-0.0763,-0.1849 ,-0.0956 ,-0.0946 ,-0.0735 ,-0.1150 ,-0.1301 ,-0.0042 ,-0.0321 ,-0.0171 ,-0.0052 ,-0.0062 ,-0.0015])) # >>> -0.06433076923076922
```
Another measurement that is often used to assess the performance of a language model is Perplexity . And a good thing is that perplexity can be easily computed based on the P values, as shown in the Language Model example of the fairseq repository:
```
 # Compute perplexity for a sequence en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores'].mean().neg().exp() # tensor(15.1474)
```
I'm not an expert on NLP, so I can't really tell you which one you should use in your case.

How to interpret the P numbers that fairseq generate produces?

Question

1 answers

solution1
3 ACCPTED 2020-07-27 03:25:19

How to interpret the P numbers that fairseq generate produces?

Question

1 answers

solution1 3 ACCPTED 2020-07-27 03:25:19

solution1
3 ACCPTED 2020-07-27 03:25:19