How to properly iterate over unicode characters in Python

Question

I would like to iterate over a string and output all emojis.

I'm trying to iterate over the characters, and check them against an emoji list .

However, python seems to split the unicode characters into smaller ones, breaking my code. Example:

>>> list(u'Test \U0001f60d')
[u'T', u'e', u's', u't', u' ', u'\ud83d', u'\ude0d']

Any ideas why u'\\U0001f60d' gets split?

Or what's a better way to extract all emojis? This was my original extraction code:

def get_emojis(text):
  emojis = []
  for character in text:
    if character in EMOJI_SET:
      emojis.append(character)
  return emojis

Answer 1

Python pre-3.3 uses UTF-16LE (narrow build) or UTF-32LE (wide build) internally for storing Unicode, and due to leaky abstraction exposes this detail to the user. UTF-16LE uses surrogate pairs to represent Unicode characters above U+FFFF as two codepoints. Either use a wide Python build or switch to Python 3.3 or later to fix the issue.

One way of dealing with a narrow build is to match the surrogate pairs:

Python 2.7 (narrow build):

>>> s = u'Test \U0001f60d'
>>> len(s)
7
>>> re.findall(u'(?:[\ud800-\udbff][\udc00-\udfff])|.',s)
[u'T', u'e', u's', u't', u' ', u'\U0001f60d']

Python 3.6:

>>> s = 'Test \U0001f60d'
>>> len(s)
6
>>> list(s)
['T', 'e', 's', 't', ' ', '😍']

Answer 2

Try this,

import re
re.findall(r'[^\w\s,]', my_list[0])

The regex r'[^\\w\\s,]' matches any character that is not a word, whitespace or comma.

Answer 3

I've been fighting myself with Unicode and it's not as easy as it seems. There's this emoji library that wraps all the caveats (I'm not affiliated).

If you want to list all emojis that appear in the string, I'd recommend emoji.emoji_lis .

Just look into the source of emoji.emoji_lis to understand how complicated it actually is.

Example

>>> emoji.emoji_lis('🥇🥈🇧🇹')
>>> [{'location': 0, 'emoji': '🥇'}, {'location': 1, 'emoji': '🥈'}, {'location': 2, 'emoji': '🇧🇹'}]

Example with list (won't always work)

>>> list('🥇🥈🇧🇹')
>>> ['🥇', '🥈', '🇧', '🇹']

Answer 4

The problem is as described above. The possible actions to solve it described here

How to properly iterate over unicode characters in Python

Question

3 answers

solution1
6 ACCPTED 2017-10-12 16:41:28

solution2
0 2017-10-12 14:19:15

solution3
0 2022-01-05 13:50:01

solution4
-1 2017-10-14 21:11:26

How to properly iterate over unicode characters in Python

Question

3 answers

solution1 6 ACCPTED 2017-10-12 16:41:28

solution2 0 2017-10-12 14:19:15

solution3 0 2022-01-05 13:50:01

solution4 -1 2017-10-14 21:11:26

solution1
6 ACCPTED 2017-10-12 16:41:28

solution2
0 2017-10-12 14:19:15

solution3
0 2022-01-05 13:50:01

solution4
-1 2017-10-14 21:11:26