I'm querying a DB for jokes and am getting back Python str
s. I want to use them as Unicode objects, so I do:
joke = unicode(joke, 'utf-8')
This works for all my DB results and does not cause any issues.
Then I try to hash each word in each joke like this:
result = mmh3.hash(joke)
and I get back:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)
I inspected the text and it's Japanese. Does this mean I should drop all non-ascii characters before hashing or is there a better way to handle this?
Thanks!
The .hash(...)
function appears to require either bytes
or ascii
-convertible text.
The easiest way (if you're dealing entirely with unicode
objects) is to convert them to bytes
as you call mmh3.hash
:
result = mmh3.hash(joke.encode('UTF-8'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.