[英]How to train a new parser model for Stanford NLP from treebank?
我已經下載了UPDT波斯樹庫( Uppsala波斯依賴樹庫 ),並且我正嘗試使用Stanford NLP從中建立依賴關系解析器模型。 我曾嘗試使用命令行和Java代碼來訓練模型,但是在兩種情況下都出現異常。
1-使用命令行訓練模型:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser -train UPDT\train.conll 0 -saveToSerializedFile UPDT\updt.model.ser.gz
當我運行上面的命令時,我會得到這個異常:
done [read 26 trees]. Time elapsed: 0 ms
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
smooth=false
PA=true
GPA=false
selSplit=false
(0.0)
mUnary=0
mUnaryTags=false
sPPT=false
tagPA=false
tagSelSplit=false (0.0)
rightRec=false
leftRec=false
collinsPunc=false
markov=false
mOrd=1
hSelSplit=false (10)
compactGrammar=0
postPA=false
postGPA=false
selPSplit=false (0.0)
tagSelPSplit=false (0.0)
postSplitWithBase=false
fractionBeforeUnseenCounting=0.5
openClassTypesThreshold=50
preTransformer=null
taggedFiles=null
predictSplits=false
splitCount=1
splitRecombineRate=0.0
simpleBinarizedLabels=false
noRebinarization=false
trainingThreads=1
dvKBest=100
trainingIterations=40
batchSize=25
regCost=1.0E-4
qnIterationsPerBatch=1
qnEstimates=15
qnTolerance=15.0
debugOutputFrequency=0
randomSeed=0
learningRate=0.1
deltaMargin=0.1
unknownNumberVector=true
unknownDashedWordVectors=true
unknownCapsVector=true
unknownChineseYearVector=true
unknownChineseNumberVector=true
unknownChinesePercentVector=true
dvSimplifiedModel=false
scalingForInit=0.5
maxTrainTimeSeconds=0
unkWord=*UNK*
lowercaseWordVectors=false
transformMatrixType=DIAGONAL
useContextWords=false
trainWordVectors=true
stalledIterationLimit=12
markStrahler=false
Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false
sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=fals
e sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflP
RP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sV
P=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI
=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 s
TMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOi
IN=0 cWh=0
Binarizing trees...done. Time elapsed: 12 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right par
enthesis [ignored]
Exception in thread "main" java.lang.IllegalArgumentException: No head rule defi
ned for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
DELM
DELM
DELM
13
punct
_
_
15
??????
_
N
N_SING
SING
13
appos
_
_
16
???????
_
ADJ
ADJ
ADJ
15
amod
_
_
17
??
_
P
P
P
15
prep
_
_
18
???
_
N
N_SING
SING
17
pobj
_
_
19
?
_
CON
CON
CON
18
cc
_
_
20
????
_
N
N_SING
SING
18
conj
_
_
21
????
_
N
N_SING
SING
20
poss/pc
_
_
22)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialH
ead(AbstractCollinsHeadFinder.java:242)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:189)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(Abstra
ctCollinsHeadFinder.java:140)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(T
reeAnnotator.java:145)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnn
otator.java:51)
at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transform
Tree(TreeAnnotatorAndBinarizer.java:104)
at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(Composi
teTreeTransformer.java:30)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:195)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankItera
tor.next(TransformingTreebank.java:176)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.pr
imeNext(FilteringTreebank.java:100)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<i
nit>(FilteringTreebank.java:85)
at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.j
ava:72)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(Ab
stractTreeExtractor.java:64)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(Abstr
actTreeExtractor.java:89)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTree
bank(LexicalizedParser.java:881)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.main(LexicalizedP
arser.java:1394)
2-使用Java代碼訓練模型:
import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.List;
import edu.stanford.nlp.ling.HasWord;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
import edu.stanford.nlp.parser.lexparser.Options;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.trees.GrammaticalStructure;
import edu.stanford.nlp.trees.GrammaticalStructureFactory;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.trees.Treebank;
import edu.stanford.nlp.trees.TreebankLanguagePack;
public class FromTreeBank {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String treebankPathUPDT = "src/model/UPDT.1.2/train.conll";
String persianFilePath = "src/txt/persianSentences.txt";
File file = new File(treebankPathUPDT);
Options op = new Options();
Treebank tr = op.tlpParams.diskTreebank();
tr.loadPath(file);
LexicalizedParser lpc = LexicalizedParser.trainFromTreebank(tr,op);
//Once the lpc is trained, use it to parse a file which contains Persian text
//demoDP(lpc, persianFilePath);
}
public static void demoDP(LexicalizedParser lp, String filename) {
// This option shows loading, sentence-segmenting and tokenizing
// a file using DocumentPreprocessor.
TreebankLanguagePack tlp = lp.treebankLanguagePack(); // a PennTreebankLanguagePack for English
GrammaticalStructureFactory gsf = null;
if (tlp.supportsGrammaticalStructures()) {
gsf = tlp.grammaticalStructureFactory();
}
// You could also create a tokenizer here (as below) and pass it
// to DocumentPreprocessor
for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
Tree parse = lp.apply(sentence);
parse.pennPrint();
System.out.println();
if (gsf != null) {
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
Collection tdl = gs.typedDependenciesCCprocessed();
System.out.println(tdl);
System.out.println();
}
}
}
}
上述Java程序也有以下例外情況:
Options parameters:
useUnknownWordSignatures 0
smoothInUnknownsThreshold 100
smartMutation false
useUnicodeType false
unknownSuffixSize 1
unknownPrefixSize 1
flexiTag false
useSignatureForKnownSmoothing false
wordClassesFile null
parserParams edu.stanford.nlp.parser.lexparser.EnglishTreebankParserParams
forceCNF false
doPCFG true
doDep true
freeDependencies false
directional true
genStop true
distance true
coarseDistance false
dcTags true
nPrune false
Train parameters:
smooth=false
PA=true
GPA=false
selSplit=false
(0.0)
mUnary=0
mUnaryTags=false
sPPT=false
tagPA=false
tagSelSplit=false (0.0)
rightRec=false
leftRec=false
collinsPunc=false
markov=false
mOrd=1
hSelSplit=false (10)
compactGrammar=0
postPA=false
postGPA=false
selPSplit=false (0.0)
tagSelPSplit=false (0.0)
postSplitWithBase=false
fractionBeforeUnseenCounting=0.5
openClassTypesThreshold=50
preTransformer=null
taggedFiles=null
predictSplits=false
splitCount=1
splitRecombineRate=0.0
simpleBinarizedLabels=false
noRebinarization=false
trainingThreads=1
dvKBest=100
trainingIterations=40
batchSize=25
regCost=1.0E-4
qnIterationsPerBatch=1
qnEstimates=15
qnTolerance=15.0
debugOutputFrequency=0
randomSeed=0
learningRate=0.1
deltaMargin=0.1
unknownNumberVector=true
unknownDashedWordVectors=true
unknownCapsVector=true
unknownChineseYearVector=true
unknownChineseNumberVector=true
unknownChinesePercentVector=true
dvSimplifiedModel=false
scalingForInit=0.5
maxTrainTimeSeconds=0
unkWord=*UNK*
lowercaseWordVectors=false
transformMatrixType=DIAGONAL
useContextWords=false
trainWordVectors=true
stalledIterationLimit=12
markStrahler=false
Using EnglishTreebankParserParams splitIN=0 sPercent=false sNNP=0 sQuotes=false sSFP=false rbGPA=false j#=false jJJ=false jNounTags=false sPPJJ=false sTRJJ=false sJJCOMP=false sMoreLess=false unaryDT=false unaryRB=false unaryPRP=false reflPRP=false unaryIN=false sCC=0 sNT=false sRB=false sAux=0 vpSubCat=false mDTV=0 sVP=0 sVPNPAgr=false sSTag=0 mVP=false sNP%=0 sNPPRP=false dominatesV=0 dominatesI=false dominatesC=false mCC=0 sSGapped=0 numNP=false sPoss=0 baseNP=0 sNPNNP=0 sTMP=0 sNPADV=0 cTags=false rightPhrasal=false gpaRootVP=false splitSbar=0 mPPTOiIN=0 cWh=0
Binarizing trees...done. Time elapsed: 122 ms
Extracting PCFG...PennTreeReader: warning: file has extra non-matching right parenthesis [ignored]
java.lang.IllegalArgumentException: No head rule defined for _ using class edu.stanford.nlp.trees.ModCollinsHeadFinder in (_
DELM
DELM
DELM
13
punct
_
_
15
تلفیقی
_
N
N_SING
SING
13
appos
_
_
16
طنزآمیز
_
ADJ
ADJ
ADJ
15
amod
_
_
17
از
_
P
P
P
15
prep
_
_
18
اسم
_
N
N_SING
SING
17
pobj
_
_
19
و
_
CON
CON
CON
18
cc
_
_
20
شیوه
_
N
N_SING
SING
18
conj
_
_
21
کارش
_
N
N_SING
SING
20
poss/pc
_
_
22)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineNonTrivialHead(AbstractCollinsHeadFinder.java:242)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:189)
at edu.stanford.nlp.trees.AbstractCollinsHeadFinder.determineHead(AbstractCollinsHeadFinder.java:140)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTreeHelper(TreeAnnotator.java:145)
at edu.stanford.nlp.parser.lexparser.TreeAnnotator.transformTree(TreeAnnotator.java:51)
at edu.stanford.nlp.parser.lexparser.TreeAnnotatorAndBinarizer.transformTree(TreeAnnotatorAndBinarizer.java:104)
at edu.stanford.nlp.trees.CompositeTreeTransformer.transformTree(CompositeTreeTransformer.java:30)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:195)
at edu.stanford.nlp.trees.TransformingTreebank$TransformingTreebankIterator.next(TransformingTreebank.java:176)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.primeNext(FilteringTreebank.java:100)
at edu.stanford.nlp.trees.FilteringTreebank$FilteringTreebankIterator.<init>(FilteringTreebank.java:85)
at edu.stanford.nlp.trees.FilteringTreebank.iterator(FilteringTreebank.java:72)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.tallyTrees(AbstractTreeExtractor.java:64)
at edu.stanford.nlp.parser.lexparser.AbstractTreeExtractor.extract(AbstractTreeExtractor.java:89)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserFromTreebank(LexicalizedParser.java:881)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:267)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.trainFromTreebank(LexicalizedParser.java:278)
at FromTreeBank.main(FromTreeBank.java:46)
實際上,我不確定命令行或Java代碼是否正確。 我無法弄清命令行或Java代碼中缺少的內容,如果有人告訴我為什么會出現這些異常以及出了什么問題,我將不勝感激。 或建議任何更好的方法來從樹庫中訓練模型。
謝謝
如果您仍然想知道為什么會收到此錯誤,則與錯誤提示相同。 對於此字符“ _”(我認為其名稱是下划線),在edu.stanford.nlp.trees.ModCollinsHeadFinder類中未定義任何規則。
我對括號字符有相同的看法,現在刪除包含括號的數據后,我可以訓練斯坦福解析器而不會出錯。 我還沒有嘗試找到通過更改代碼來解決問題的直接解決方案。最簡單的方法是讓您像我一樣刪除包含該字符的數據。
如果您已經解決了問題,可以分享嗎? 我還需要更多有關斯坦福解析器的知識。
這里最大的問題是,您正在嘗試使用依賴項樹庫訓練選區樹解析器(也稱為短語結構樹解析器),這將無法正常工作。
CoreNLP還帶有基於神經網絡的依賴解析器,您可以使用UPDT數據進行訓練。 查看解析器的項目頁面 ,以獲取有關如何訓練模型的說明。
您可以簡單地在“ trainFile.conll”(或任何其他格式)中將所有“(”替換為“ -LRB-”,將所有“)”替換為“ -RRB-”,然后重新運行解析器。 這對我有用。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.