简体   繁体   中英

Stanford POS tagger in Java usage

Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)

These are the errors that I'm getting when I want to assign POS tags to sentences. I read sentences from a file. Initially (for few sentences) I'm not getting this error (ie untokenizable), but after reading some sentences this error arises. I use v2.0 (ie 2009) of POS tagger and model is left3words .

I agree with Yuval -- a character encoding problem, but the commonest case is actually when the file is in a single byte encoding such as ISO-8859-1 while the tagger is trying to read it in UTF-8. See the discussion of U+FFFD on Wikipedia .

This looks like an encoding problem to me. Can you post the offending sentence? I couldn't find this in the documentation, but I would try checking if the file is in UTF-8 encoding.

I ran into this issue, as well. One way to test whether a character is tokenizable is to check whether it fails Character.isIdentifierIgnorable() . A character that is untokenizable will return true , while all tokenizable characters will return false .

If you are reading content from DOC, Portable Document Format(PDF) then Use Apache Tika . It Will extract your content. It might help you.

Apache Tika

About tika

Apache Tika is a toolkit for detecting and extracting meta data and structured text content from various documents using existing parser libraries. It is written in Java, but includes a command line version for use from other languages.

More information on Tika, the bug tracker, mailing lists, downloads and more are available at http://tika.apache.org/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM