简体   繁体   中英

Unicode Emoji's in python from csv files

I have some csv data of some users tweet.

In excel it is displayed like this:

‰ÛÏIt felt like they were my friends and I was living the story with them‰Û  #retired #IAN1 

I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)

▒▒▒It felt like they were my friends and I was living the story with them▒۝ #retired #IAN1 

I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.

In fact, you certainly have a loss of data…

I don't know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".

If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.

Then, if you open your CSV file into Excel, it will use it's default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.

The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!

If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).

This kind of characters are badly handled in Python, try this:

# coding: utf8
from __future__ import unicode_literals

emoji = u"😀"

print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))

You'll get (if your console allow it):

emoji: 😀
repr: u'\U0001f600'
len: 2
  • The first line won't print if your console don't allow unicode,
  • The \\U escape sequence is similar to the \\u\u003c/code> , but expects 8 hex digits, not 4.
  • Yes, this character has a length of 2!

EDIT: With Python 3, you get:

emoji: 😀
repr: '😀'
len: 1
  • No escape sequence for repr() ,
  • the length is 1!

What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…

See also Unicode Literals in Python Source Code in the Python 2.7 documentation.

First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.

I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?

I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM