Python mmh3: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

Question

I'm querying a DB for jokes and am getting back Python str s. I want to use them as Unicode objects, so I do:

joke = unicode(joke, 'utf-8')

This works for all my DB results and does not cause any issues.

Then I try to hash each word in each joke like this:

result = mmh3.hash(joke)

and I get back:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

I inspected the text and it's Japanese. Does this mean I should drop all non-ascii characters before hashing or is there a better way to handle this?

Thanks!

Answer 1

The .hash(...) function appears to require either bytes or ascii -convertible text.

The easiest way (if you're dealing entirely with unicode objects) is to convert them to bytes as you call mmh3.hash :

result = mmh3.hash(joke.encode('UTF-8'))

Python mmh3: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

Question

1 answers

solution1
4 ACCPTED 2018-08-25 23:31:02

Python mmh3: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

Question

1 answers

solution1 4 ACCPTED 2018-08-25 23:31:02

solution1
4 ACCPTED 2018-08-25 23:31:02