EDITED: I've updated my traceback below
I know this kind of problems has been asked for many times, but I have been struggling to this issue 2 days and still can't figure a solution.
Here the case: I'm using pycrfsuite
(a python implementation of CRF), and this snippets raise UnicodeEncodeError
.
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)
Error ...
Traceback (most recent call last):
File "/home/enamoria/Dropbox/NLP/POS-tagger/MyTagger/V2_CRF/src/pos-tag/pos-tag.py", line 46, in <module>
trainer.append(xseq, yseq)
File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
File "stringsource", line 48, in vector.from_py.__pyx_convert_vector_from_py_std_3a__3a_string
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
UnicodeEncodeError: 'ascii' codec can't encode character '\u201d' in position 0: ordinal not in range(128)
\”
is the closing double quote ” in utf8
encoding. This exception was also raised for \“
( opening double quote ) and \…
( ellipsis IIRC)
FYI, X_train
and y_train
is a features representation of a text and its corresponding labels, which I read from a file. I've try using encoding='utf8', errors='ignore'
but the error still there
for file in filelist:
with open(self.datapath + "/" + file, "r", encoding='utf8', errors='ignore') as f:
raw_text = [(line.strip("\n").strip(" ").replace(" ", " ").replace(" ", " ")).split(" ") for line in f.readlines() if line != '\n']
data.extend(raw_text)
My question is: Is pycrfsuite
only support ascii
encoding? If so, is there any workaround available for me? My data is Vietnamese which ascii can't represent, and a new crf library is the last thing I want
Thanks in advance.
The pycrfsuite docs don't mention what their Unicode support is for feature values and keys. I can't tell from the examples either, as it isn't clear to me if they are Python 2 or 3. Also, I don't know enough about Cython to give you a definite answer by reading the source.
In any case, I suggest you try two things:
Just encode the keys yourself before you pass them to the library. If the values are strings too, encode them as well. Maybe the library is happy to accept bytes
objects.
If that doesn't work (because it really wants to have ASCII), use some ASCII encoding, eg. use urlencode
or call Python's ascii()
built-in function on the string. The latter will encode 'can't'
to "'can\\\’t'"
, with backslash escapes and quotes. It doesn't really matter, since the classifier doesn't care about how the feature keys look, as long as the same input produces the same feature key.
I hope this helps!
Before the for
loop you can use the encode('utf-8')
method for each string element on xseq
and yseq
.
One element of my xseq
that make me problems looks like this now [b'nxtletter=<\\xc3\\xad']
This is my code
def sent2features(data):
return [extractFeatures(sent) for sent in data]
def sent2labels(data):
return [extractLabels(sent) for sent in data]
X_train = sent2features(train_data)
Y_train = sent2labels(train_data)
for xseq, yseq in zip(X_train, Y_train):
trainer.append(xseq, yseq)
The encoding lines on extractFeatures
and extractLabels
functions looks like this
def extractFeatures(sent):
feature_list = []
for sent in sents:
word_len = len(sent)
for letter in sent:
.
. # Here I define my features list
.
feature_list.append([f.encode('utf-8') for f in features]) # Here add encoding for pysrfsuite
return feature_list
def extractLabels(sent):
labels = []
for word in sent:
for letter in word:
labels.append(letter[2].encode('utf-8')) # Here add encoding for pysrfsuite
return labels
Maybe works for you. Good luck!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.