Stanford POS tagger in Java usage

Question

Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)

These are the errors that I'm getting when I want to assign POS tags to sentences. I read sentences from a file. Initially (for few sentences) I'm not getting this error (ie untokenizable), but after reading some sentences this error arises. I use v2.0 (ie 2009) of POS tagger and model is left3words .

Answer 1

I agree with Yuval -- a character encoding problem, but the commonest case is actually when the file is in a single byte encoding such as ISO-8859-1 while the tagger is trying to read it in UTF-8. See the discussion of U+FFFD on Wikipedia .

Answer 2

This looks like an encoding problem to me. Can you post the offending sentence? I couldn't find this in the documentation, but I would try checking if the file is in UTF-8 encoding.

Answer 3

I ran into this issue, as well. One way to test whether a character is tokenizable is to check whether it fails Character.isIdentifierIgnorable() . A character that is untokenizable will return true , while all tokenizable characters will return false .

Answer 4

If you are reading content from DOC, Portable Document Format(PDF) then Use Apache Tika . It Will extract your content. It might help you.

Apache Tika

About tika

Apache Tika is a toolkit for detecting and extracting meta data and structured text content from various documents using existing parser libraries. It is written in Java, but includes a command line version for use from other languages.

More information on Tika, the bug tracker, mailing lists, downloads and more are available at http://tika.apache.org/

Stanford POS tagger in Java usage

Question

4 answers

solution1
8 ACCPTED 2011-03-10 04:39:13

solution2
2 2011-03-09 09:06:54

solution3
1 2014-07-11 21:55:24

solution4
0 2013-08-01 06:49:06

Stanford POS tagger in Java usage

Question

4 answers

solution1 8 ACCPTED 2011-03-10 04:39:13

solution2 2 2011-03-09 09:06:54

solution3 1 2014-07-11 21:55:24

solution4 0 2013-08-01 06:49:06

solution1
8 ACCPTED 2011-03-10 04:39:13

solution2
2 2011-03-09 09:06:54

solution3
1 2014-07-11 21:55:24

solution4
0 2013-08-01 06:49:06