限制斯坦福NER的迭代次數

Question

我在定制的數據集上訓練斯坦福NER CRF模型，但是用於訓練模型的迭代次數現在已經達到333次迭代 - 即這個訓練過程現在已經持續了幾個小時。 以下是終端中打印的信息 -

Iter 335 evals 400 <D> [M 1.000E0] 2.880E3 38054.87s |5.680E1| {6.652E-6} 4.488E-4 - 
Iter 336 evals 401 <D> [M 1.000E0] 2.880E3 38153.66s |1.243E2| {1.456E-5} 4.415E-4 -
 -

正在使用的屬性文件如下所示 - 在某種程度上我可以將迭代次數限制為20。

location of the training file
trainFile = TRAIN5000.tsv
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model_TRAIN5000.ser.gz

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
printFeatures=true
flag useObservedSequencesOnly=true
featureDiffThresh=0.05

Answer 1

我在https://nlp.stanford.edu/software/crf-faq.html中描述了IOB標記的標記化文本，通過Stanford CoreNLP CRF classifier訓練生物醫學（BioNER）模型。

我的語料庫 - 來自下載的來源 - 非常大（約1.5M行; 6個特征：GENE; ......）。 由於培訓似乎無限期地運行，我繪制了值的比率以了解進度：

抓住Java源代碼，我發現默認的TOL （ tolerance ;用來決定何時終止訓練會話）的值是1E-6（0.000001），在.../CoreNLP/src/edu/stanford/nlp/optimization/QNMinimizer.java 。

看看那個情節，我原來的訓練課程永遠不會完成。 [該圖還顯示，設置較大的TOL值，例如tolerance=0.05 ，將觸發訓練的提前終止，因為TOL值由在訓練期開始附近發生的“噪聲”觸發。 我在.prop文件中使用tolerance=0.05條目確認了這一點; 然而， TOL值0.01 ， 0.005等等是“OK”。]

正如@StanfordNLPHelp（此線程中的其他地方）所描述的那樣，將“ maxIterations=20 ”添加到屬性文件似乎被忽略，除非我還在我的bioner.prop屬性文件中添加並更改了tolerance= value; 例如

tolerance=0.005
maxIterations=20    ## optional

在這種情況下，分類器快速訓練模型（ bioner.ser.gz ）。 [當我將maxIterations行添加到我的.prop文件中時，不添加tolerance行，模型就像以前一樣“永遠”繼續運行。

可以在此處找到可包含在.prop文件中的參數列表：

https://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ie/NERFeatureFactory.html

Answer 2

簡答：使用tolerance （默認為1e-4）。 還有另一個參數maxIterations被忽略。

Answer 3

在prop文件中使用maxQNItr=21 。 它將運行多達20次迭代。 得到大衛答案的幫助

Answer 4

將maxIterations=20添加到屬性文件中。

限制斯坦福NER的迭代次數

問題描述

4 個解決方案

解決方案1
1 2017-11-06 19:01:01

解決方案2
0 2018-09-24 13:57:08

解決方案3
0 2019-03-14 18:10:13

解決方案4
-1 2017-04-09 07:20:53

限制斯坦福NER的迭代次數

問題描述

4 個解決方案

解決方案1 1 2017-11-06 19:01:01

解決方案2 0 2018-09-24 13:57:08

解決方案3 0 2019-03-14 18:10:13

解決方案4 -1 2017-04-09 07:20:53

解決方案1
1 2017-11-06 19:01:01

解決方案2
0 2018-09-24 13:57:08

解決方案3
0 2019-03-14 18:10:13

解決方案4
-1 2017-04-09 07:20:53