Convert emoji title to unicode

Question

I using Twint to extract tweets resulted from a particular search (that gives me about 100k tweets). The problem is that Twint outputs the tweet content with the emoji title and not its specific unicode. This is one example:

@LulapeloBrasil presidente minha eterna gratidão a tudo que senhor fez, faz e fará ao nosso povo. Seguiremos lutando pelos nossos ideais! <Emoji: Heavy red heart>  <Emoji: Flexed biceps (dark skin tone)> #LulaLivre #EusouLula #LulaValeALuta #OcupaSaoBernardo

This is bad because I want to tokenize the tweet for further analysis (eg emoji usage) and a traditional tweet tokenizer (eg nltk TweetTokenizer) won't tokenize properly.

Do you have any suggestion about how can I convert these emojis titles to their respective unicode (I'm able to extract the titles only using re )?

Where can I get the data that emojepedia uses? Or where can I download a list of all twitter emojis containing their unicode code and titles?

Answer 1

I found these files (with the help of @Philip Couling). It's a start to solve the problem, although some additional processing will be needed.

Answer 2

Here is a python package can maybe solve your problem

emotlib - Python emoji + emoticon Library (<ゝω・)☆ 👨‍🚀👩‍🚀

Is easy to use and support 2.7, 3.6 and support to Emoji 11.0.

But you still need to process the text first I think.

Convert emoji title to unicode

Question

2 answers

solution1
0 2018-06-06 15:50:42

solution2
0 2018-06-17 18:24:08

Convert emoji title to unicode

Question

2 answers

solution1 0 2018-06-06 15:50:42

solution2 0 2018-06-17 18:24:08

solution1
0 2018-06-06 15:50:42

solution2
0 2018-06-17 18:24:08