简体   繁体   中英

How to remove everything except words and emoji from text?

As a part of text classification problem I am trying to clean a text dataset. So far I was removing everything except text. Punctuation, numbers, emoji - everything was removed. Now I am trying to use emoji as features hence I want to retain words as well emoji.

First I am searching the emoji in the text and separating them from other words/emoji. This is because each emoji should be treated individually/separately. So I search an emoji and pad it with spaces at both its ends.

But I am at loss while figuring out how to combine the known regex for words and emoji. Here is my current code:

import re

def clean_text(raw_text):

    padded_emoji_text = pad_emojis(raw_text)
    print("Emoji padded text: " + padded_emoji_text)

    reg = re.compile("[^a-zA-Z]") # line a

    # old regex to remove everything except words  
    letters_only_text = reg.sub(' ', raw_text)
    print("Cleaned text: " + letters_only_text)

    # Code to remove everything except text and emojis
    # How?

def pad_emojis(raw_text):

    print("Original Text: " + raw_text)

    reg = re.compile(u'['
      u'\U0001F300-\U0001F64F'
      u'\U0001F680-\U0001F6FF'
      u'\u2600-\u26FF\u2700-\u27BF]', 
      re.UNICODE)

    #padding the emoji with space at both ends
    new_text = reg.sub(r' \g<0> ',raw_text) 

    return new_text

text = "I am very #happy man! but😘😘 my wife😞 is not 😊😘. 99/33"
clean_text(text)

Current o/p:

Original Text: I am very #happy man! but😘😘 my wife😞 is not 😊😘. 99/33
Emoji padded text: I am very #happy man! but 😘  😘  my wife 😞  is not  😊  😘 . 99/33
Cleaned text: I am very  happy man  but   my wife  is not

What I am trying to achieve:

I am very happy man but 😘  😘  my wife 😞  is not  😊  😘

Questions:

1) How do I add the emoji regex to regex compilation along with the words regex? (line a)

2) Also can I achieve what I am seeking in a better way ie without having to write a separate function just to separate the emoji and pad them with spaces? I somehow feel this can be avoided.

You may join the two steps into one using a single regex and a lambda expression inside a re.sub like this:

import re

emoji_pat = '[\U0001F300-\U0001F64F\U0001F680-\U0001F6FF\u2600-\u26FF\u2700-\u27BF]'
shrink_whitespace_reg = re.compile(r'\s{2,}')

def clean_text(raw_text):
    reg = re.compile(r'({})|[^a-zA-Z]'.format(emoji_pat)) # line a
    result = reg.sub(lambda x: ' {} '.format(x.group(1)) if x.group(1) else ' ', raw_text)
    return shrink_whitespace_reg.sub(' ', result)

text = 'I am very #happy man! but😘😘 my wife😞 is not 😊😘. 99/33'
print('Cleaned text: ' + clean_text(text))
# => Cleaned text: I am very happy man but 😘 😘 my wife 😞 is not 😊 😘

See the Python demo

Explanation :

  • The first regex will look like ([\\U0001F300-\\U0001F64F\\U0001F680-\\U0001F6FF\☀-\⛿\✀-\➿])|[^A-Za-z] and will match and capture into Group 1 an emoji or will just match any char other than an ASCII letter. If the emoji was captured (see if x.group(1) inside the lambda), the emoji will be returned back enclosed with spaces on both sides, else, the space will be used to replace a non-letter
  • The \\s{2,} pattern will match 2 or more whitespaces and shrink_whitespace_reg.sub(' ', result) will replace all these chunks with a single whitespace.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM