I'm using the German components of CoreNLP and trying to cleanup HTML tags from text. I tried both the cleanxml
annotator ( clean.xmltags = .*
) and the relevant option of ssplit
( ssplit.htmlBoundariesToDiscard = p,text,img,a
) and nothing has worked.
Are these case sensitive (I tried both p
and P
together, didn't seem to make a difference)? Does this have anything to do with the fact that the text is in German?
Thanks!
Okay here is a completely changed answer!
I was running Stanford CoreNLP 3.5.2 and ran this command in the distribution folder:
java -Xmx6g -cp "*:." -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos,lemma,ner,parse -file sample_german.html -outputFormat text
I put a German sentence in between some basic HTML tags and the output had the tags removed. Does this command work for your documents?
I am not sure how you are calling the pipeline but the above should work for command line. If you are using a .prop file, make sure you have cleanxml as the annotator after tokenize as in this example:
annotators = tokenize,cleanxml,ssplit,pos,lemma,ner,parse
Please let me know if its still not working!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.