I'm using Python2 on Spark (PySpark and Pandas) to analyze data about emoji usage. I have a string like u'u+1f375'
or u'u+1f618'
that I want to convert to 🍵
and 😘
respectively.
I've read several other SO posts and the unicode HOWTO , trying to grasp encode
and decode
to no avail.
This didn't work:
decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'
This ended up working on a one-off basis, but fails the moment I apply it to my RDD.
def rename_if_emoji(pattern):
"""rename the element name of dataframe with emoji"""
if pattern.lower().startswith("u+"):
emoji_string = ""
EMOJI_PREFIX = "u+"
for part_org in pattern.lower().split(" "):
part = part_org.strip();
if (part.startswith(EMOJI_PREFIX)):
padding = "0" * (8 + len(EMOJI_PREFIX) - len(part))
codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
print("codepoint: " + codepoint)
emoji_string += codepoint.decode('unicode-escape')
print("emoji_string: " + emoji_string)
return emoji_string
else:
return pattern
rename_if_emoji_udf = udf(rename_if_emoji)
Error: UnicodeEncodeError: 'ascii' codec can't encode character u'\\U0001f618' in position 14: ordinal not in range(128)
The ability to print emoji correctly depends on the IDE/terminal used. You'll get a UnicodeEncodeError
on an unsupported terminal due to Python 2's print
encoding Unicode strings to the terminal's encoding. You also need font support. You're error is on the print
. You've decoded it correctly but your output device ideally should support UTF-8.
The example simplifies the decoding process. I print the repr()
of the string in case the terminal isn't configured to support the characters being printed.
import re
def replacement(m):
'''Assume the matched characters are hexadecimal, convert to integer,
format appropriately, and decode back to Unicode.
'''
i = int(m.group(1),16)
return '\\U{:08X}'.format(i).decode('unicode-escape')
def replace(s):
'''Replace all u+nnnn strings with the Unicode equivalent.
'''
return re.sub(ur'u\+([0-9a-fA-F]+)',replacement,s)
s = u'u+1f618 u+1f375'
t = replace(s)
print repr(t)
print t
Output (on a UTF-8 IDE):
u'\U0001f618 \U0001f375'
😘 🍵
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.