I using Twint to extract tweets resulted from a particular search (that gives me about 100k tweets). The problem is that Twint outputs the tweet content with the emoji title and not its specific unicode. This is one example:
@LulapeloBrasil presidente minha eterna gratidão a tudo que senhor fez, faz e fará ao nosso povo. Seguiremos lutando pelos nossos ideais! <Emoji: Heavy red heart> <Emoji: Flexed biceps (dark skin tone)> #LulaLivre #EusouLula #LulaValeALuta #OcupaSaoBernardo
This is bad because I want to tokenize the tweet for further analysis (eg emoji usage) and a traditional tweet tokenizer (eg nltk TweetTokenizer) won't tokenize properly.
Do you have any suggestion about how can I convert these emojis titles to their respective unicode (I'm able to extract the titles only using re
)?
Where can I get the data that emojepedia uses? Or where can I download a list of all twitter emojis containing their unicode code and titles?
I found these files (with the help of @Philip Couling). It's a start to solve the problem, although some additional processing will be needed.
Here is a python package can maybe solve your problem
emotlib - Python emoji + emoticon Library (<ゝω・)☆ 👨🚀👩🚀
Is easy to use and support 2.7, 3.6 and support to Emoji 11.0.
But you still need to process the text first I think.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.