简体   繁体   中英

UnicodeEndcodeError - utf-8 encoding in python-crfsuite (pycrfsuite)

EDITED: I've updated my traceback below

I know this kind of problems has been asked for many times, but I have been struggling to this issue 2 days and still can't figure a solution.
Here the case: I'm using pycrfsuite (a python implementation of CRF), and this snippets raise UnicodeEncodeError .

 trainer = pycrfsuite.Trainer(verbose=True)
 for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

Error ...

Traceback (most recent call last):  
File "/home/enamoria/Dropbox/NLP/POS-tagger/MyTagger/V2_CRF/src/pos-tag/pos-tag.py", line 46, in <module>
     trainer.append(xseq, yseq)
File "pycrfsuite/_pycrfsuite.pyx", line 312, in pycrfsuite._pycrfsuite.BaseTrainer.append
File "stringsource", line 48, in vector.from_py.__pyx_convert_vector_from_py_std_3a__3a_string
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
UnicodeEncodeError: 'ascii' codec can't encode character '\u201d' in position 0: ordinal not in range(128)

\” is the closing double quote in utf8 encoding. This exception was also raised for \“ ( opening double quote ) and \… ( ellipsis IIRC)

FYI, X_train and y_train is a features representation of a text and its corresponding labels, which I read from a file. I've try using encoding='utf8', errors='ignore' but the error still there

 for file in filelist:
        with open(self.datapath + "/" + file, "r", encoding='utf8', errors='ignore') as f:
            raw_text = [(line.strip("\n").strip(" ").replace("   ", " ").replace("  ", " ")).split(" ") for line in f.readlines() if line != '\n']
            data.extend(raw_text)  

My question is: Is pycrfsuite only support ascii encoding? If so, is there any workaround available for me? My data is Vietnamese which ascii can't represent, and a new crf library is the last thing I want

Thanks in advance.

The pycrfsuite docs don't mention what their Unicode support is for feature values and keys. I can't tell from the examples either, as it isn't clear to me if they are Python 2 or 3. Also, I don't know enough about Cython to give you a definite answer by reading the source.

In any case, I suggest you try two things:

  1. Just encode the keys yourself before you pass them to the library. If the values are strings too, encode them as well. Maybe the library is happy to accept bytes objects.

  2. If that doesn't work (because it really wants to have ASCII), use some ASCII encoding, eg. use urlencode or call Python's ascii() built-in function on the string. The latter will encode 'can't' to "'can\\\’t'" , with backslash escapes and quotes. It doesn't really matter, since the classifier doesn't care about how the feature keys look, as long as the same input produces the same feature key.

I hope this helps!

Before the for loop you can use the encode('utf-8') method for each string element on xseq and yseq .

One element of my xseq that make me problems looks like this now [b'nxtletter=<\\xc3\\xad']

This is my code

def sent2features(data):
    return [extractFeatures(sent) for sent in data]

def sent2labels(data):
    return [extractLabels(sent) for sent in data]

X_train = sent2features(train_data)
Y_train = sent2labels(train_data)

for xseq, yseq in zip(X_train, Y_train):
    trainer.append(xseq, yseq)

The encoding lines on extractFeatures and extractLabels functions looks like this

def extractFeatures(sent):
    feature_list = []
    for sent in sents:
       word_len = len(sent)
       for letter in sent:
           .
           .  # Here I define my features list
           .
           feature_list.append([f.encode('utf-8') for f in features])  # Here add encoding for pysrfsuite
    return feature_list

def extractLabels(sent):
    labels = []
    for word in sent:
        for letter in word:
            labels.append(letter[2].encode('utf-8')) # Here add encoding for pysrfsuite
    return labels

Maybe works for you. Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM