简体   繁体   中英

Match unicode emoji in python regex

I need to extract the text between a number and an emoticon in a text

example text:

blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv

output:

extract1
extract2

The regex code that I wrote extracts the text between 2 numbers, I need to change the part where it identifies the unicode emoji characters and extracts text between them.

(?<=[\s][\d])(.*?)(?=[\d])

Please suggest a python friendly method, and I need it to work with all the emoji's not only the one's given in the example

https://regex101.com/r/uT1fM0/1

Since there are a lot of emoji with different unicode values , you have to explicitly specify them in your regex, or if they are with a spesific range you can use a character class. In this case your second simbol is not a standard emoji, it's just a unicode character, but since it's greater than \☺ (the unicode representation of ☺️) you can put it in a range with \☺ :

In [71]: s = 'blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv'

In [72]: regex = re.compile(r'\d+(.*?)(?:\u263a|\U0001f645)')

In [74]: regex.findall(s)
Out[74]: [' extract1  ', ' extract2 ']

Or if you want to match more emojies you can use a character range (here is a good reference which shows you the proper range for different emojies http://apps.timwhitlock.info/emoji/tables/unicode ):

In [75]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [76]: regex.findall(s)
Out[76]: [' extract1  ', ' extract2 ']

Note that in second case you have to make sure that all the characters withn the aforementioned range are emojies that you want.

Here is another example:

In [77]: s = "blah 4 xzuyguhbc 😺 ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv"

In [78]: regex = re.compile(r'\d+(.*?)[\u263a-\U0001f645]')

In [79]: regex.findall(s)
Out[79]: [' xzuyguhbc ', ' extract1  ', ' extract2 ']

Here's my stab at the solution. Not sure if it will work in all circumstances. The trick is to convert all unicode emojis into normal text. This could be done by following this post Then you can match the emoji just as any normal text. Note that it won't work if the literal strings \\u\u003c/code> or \\U is in your searched text.

Example: Copy your string into a file, let's call it emo . In terminal:

Chip chip@ 03:24:33@ ~: cat emo | python stackoverflow.py
blah xzuyguhbc ibcbb bqw 2 extract1  \u263a\ufe0f jbjhcb 6 extract2 \U0001f645 bjvcvvv\n
------------------------
[' extract1  ', ' extract2 ']

Where stackoverflow.py file is:

import fileinput
a = fileinput.input();
for line in a:
    teststring = unicode(line,'utf-8')
    teststring = teststring.encode('unicode-escape')

import re
print teststring
print "------------------------"
m = re.findall('(?<=[\s][\d])(.*?)(?=\\\\[uU])', teststring)
print m

So this may or not work depending on your needs. If you know the emoji's ahead of time though this will probably work, you just need a list of the types of emoticons to expect.

Anyway without more information, this is what I'd do.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

my_regex = re.compile(r'\d\s*([^☺️|^🙅]+)')

string = "blah xzuyguhbc ibcbb bqw 2 extract1  ☺️ jbjhcb 6 extract2 🙅 bjvcvvv"

m = my_regex.findall(string)
if m:
  print m

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM