neither ssplit.htmlBoundariesToDiscard nor clean.xmlTags working in German

Question

I'm using the German components of CoreNLP and trying to cleanup HTML tags from text. I tried both the cleanxml annotator ( clean.xmltags = .* ) and the relevant option of ssplit ( ssplit.htmlBoundariesToDiscard = p,text,img,a ) and nothing has worked.

Are these case sensitive (I tried both p and P together, didn't seem to make a difference)? Does this have anything to do with the fact that the text is in German?

Thanks!

Answer 1

Okay here is a completely changed answer!

I was running Stanford CoreNLP 3.5.2 and ran this command in the distribution folder:

java -Xmx6g -cp "*:." -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos,lemma,ner,parse -file sample_german.html -outputFormat text

I put a German sentence in between some basic HTML tags and the output had the tags removed. Does this command work for your documents?

I am not sure how you are calling the pipeline but the above should work for command line. If you are using a .prop file, make sure you have cleanxml as the annotator after tokenize as in this example:

annotators = tokenize,cleanxml,ssplit,pos,lemma,ner,parse

Please let me know if its still not working!

neither ssplit.htmlBoundariesToDiscard nor clean.xmlTags working in German

Question

1 answers

solution1
0 2015-09-08 09:58:26

neither ssplit.htmlBoundariesToDiscard nor clean.xmlTags working in German

Question

1 answers

solution1 0 2015-09-08 09:58:26

solution1
0 2015-09-08 09:58:26