简体   繁体   中英

Convert emoji title to unicode

I using Twint to extract tweets resulted from a particular search (that gives me about 100k tweets). The problem is that Twint outputs the tweet content with the emoji title and not its specific unicode. This is one example:

@LulapeloBrasil presidente minha eterna gratidão a tudo que senhor fez, faz e fará ao nosso povo. Seguiremos lutando pelos nossos ideais! <Emoji: Heavy red heart>  <Emoji: Flexed biceps (dark skin tone)> #LulaLivre #EusouLula #LulaValeALuta #OcupaSaoBernardo

This is bad because I want to tokenize the tweet for further analysis (eg emoji usage) and a traditional tweet tokenizer (eg nltk TweetTokenizer) won't tokenize properly.

Do you have any suggestion about how can I convert these emojis titles to their respective unicode (I'm able to extract the titles only using re )?

Where can I get the data that emojepedia uses? Or where can I download a list of all twitter emojis containing their unicode code and titles?

I found these files (with the help of @Philip Couling). It's a start to solve the problem, although some additional processing will be needed.

Here is a python package can maybe solve your problem

emotlib - Python emoji + emoticon Library (<ゝω・)☆ 👨‍🚀👩‍🚀

Is easy to use and support 2.7, 3.6 and support to Emoji 11.0.

But you still need to process the text first I think.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM