简体   繁体   中英

Python mmh3: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

I'm querying a DB for jokes and am getting back Python str s. I want to use them as Unicode objects, so I do:

joke = unicode(joke, 'utf-8')

This works for all my DB results and does not cause any issues.

Then I try to hash each word in each joke like this:

result = mmh3.hash(joke)

and I get back:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

I inspected the text and it's Japanese. Does this mean I should drop all non-ascii characters before hashing or is there a better way to handle this?

Thanks!

The .hash(...) function appears to require either bytes or ascii -convertible text.

The easiest way (if you're dealing entirely with unicode objects) is to convert them to bytes as you call mmh3.hash :

result = mmh3.hash(joke.encode('UTF-8'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM