简体   繁体   中英

neither ssplit.htmlBoundariesToDiscard nor clean.xmlTags working in German

I'm using the German components of CoreNLP and trying to cleanup HTML tags from text. I tried both the cleanxml annotator ( clean.xmltags = .* ) and the relevant option of ssplit ( ssplit.htmlBoundariesToDiscard = p,text,img,a ) and nothing has worked.

Are these case sensitive (I tried both p and P together, didn't seem to make a difference)? Does this have anything to do with the fact that the text is in German?

Thanks!

Okay here is a completely changed answer!

I was running Stanford CoreNLP 3.5.2 and ran this command in the distribution folder:

java -Xmx6g -cp "*:." -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos,lemma,ner,parse -file sample_german.html -outputFormat text

I put a German sentence in between some basic HTML tags and the output had the tags removed. Does this command work for your documents?

I am not sure how you are calling the pipeline but the above should work for command line. If you are using a .prop file, make sure you have cleanxml as the annotator after tokenize as in this example:

annotators = tokenize,cleanxml,ssplit,pos,lemma,ner,parse

Please let me know if its still not working!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM