简体   繁体   中英

Convert unicode string representation of emoji to unicode emoji in python

I'm using Python2 on Spark (PySpark and Pandas) to analyze data about emoji usage. I have a string like u'u+1f375' or u'u+1f618' that I want to convert to 🍵 and 😘 respectively.

I've read several other SO posts and the unicode HOWTO , trying to grasp encode and decode to no avail.

This didn't work:

decode_udf = udf(lambda x: x.decode('unicode-escape'))
foo = emojis.withColumn('decoded_emoji', decode_udf(emojis.emoji))
Result: decoded_emoji=u'u+1f618'

This ended up working on a one-off basis, but fails the moment I apply it to my RDD.

def rename_if_emoji(pattern):
  """rename the element name of dataframe with emoji"""

  if pattern.lower().startswith("u+"):
    emoji_string = ""
    EMOJI_PREFIX = "u+"
    for part_org in pattern.lower().split(" "):
      part = part_org.strip();
      if (part.startswith(EMOJI_PREFIX)):
        padding = "0" * (8 + len(EMOJI_PREFIX) - len(part)) 
        codepoint = '\U' + padding + part[len(EMOJI_PREFIX):]
        print("codepoint: " + codepoint)
        emoji_string += codepoint.decode('unicode-escape')
        print("emoji_string: " + emoji_string)
    return emoji_string
  else:
    return pattern

rename_if_emoji_udf = udf(rename_if_emoji)

Error: UnicodeEncodeError: 'ascii' codec can't encode character u'\\U0001f618' in position 14: ordinal not in range(128)

The ability to print emoji correctly depends on the IDE/terminal used. You'll get a UnicodeEncodeError on an unsupported terminal due to Python 2's print encoding Unicode strings to the terminal's encoding. You also need font support. You're error is on the print . You've decoded it correctly but your output device ideally should support UTF-8.

The example simplifies the decoding process. I print the repr() of the string in case the terminal isn't configured to support the characters being printed.

import re

def replacement(m):
    '''Assume the matched characters are hexadecimal, convert to integer,
       format appropriately, and decode back to Unicode.
    '''
    i = int(m.group(1),16)
    return '\\U{:08X}'.format(i).decode('unicode-escape')

def replace(s):
    '''Replace all u+nnnn strings with the Unicode equivalent.
    '''
    return re.sub(ur'u\+([0-9a-fA-F]+)',replacement,s)

s = u'u+1f618 u+1f375'
t = replace(s)
print repr(t)
print t

Output (on a UTF-8 IDE):

u'\U0001f618 \U0001f375'
😘 🍵

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM