简体   繁体   中英

How can I handle encoding properly when passing data from a news feed to an IRC server

Code:

import socket, feedparser

feed = feedparser.parse("http://pwnmyi.com/feed")
latest = feed.entries[0]
art_name = latest.title

network = 'irc.rizon.net'
port = 6667
irc = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
irc.connect((network, port))
print irc.recv(4096)
irc.send('NICK PwnBot\r\n')
irc.send('USER PwnBot PwnBot PwnBot :PwnBot by Fike\r\n')
irc.send('JOIN #pwnmyi\r\n')
while True:
    data = irc.recv(4096)
    if data.find('PING') != -1:
        irc.send('PONG ' + data.split() [1] + '\r\n')
    if data.find( '!latest' ) != -1:
        irc.send('PRIVMSG #pwnmyi :Latest Article: ' + art_name + '\r\n')

It connects etc., but then when I do !latest in the channel, it just quits with this:

    irc.send('PRIVMSG #pwnmyi :Latest Article: ' + art_name + '\r\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 55: ordinal not in range(128)

Could you please help me debug this code? It used to work for me before.

the IRC protocol does not define a particular character set encoding used for messages, rather it's an 8bit protocol, which has certain octets used for control characters. (See rfc1459 section 2.2 .

Apparently the popular mIRC client will decode utf8 sequences if it recognizes them as such, and this makes pretty decent sense for irc's use since ascii codepoints are encoded with the same bytes as the ascii characters, and non-ascii codepoints are all encoded as values > 127.

In python, that's spelled unicode.encode(encoding='utf8') like so:

>>> u'\u0ca0_\u0ca0'.encode('utf8')
'\xe0\xb2\xa0_\xe0\xb2\xa0'

Personally I'd recommend converting all strings to 'utf-8', you can encode/decode unicode strings using this:

def decode(bytes):
    try:
        text = bytes.decode('utf-8')
    except UnicodeDecodeError:
        try:
            text = bytes.decode('iso-8859-1')
        except UnicodeDecodeError:
            text = bytes.decode('cp1252')
    return text


def encode(bytes):
    try:
        text = bytes.encode('utf-8')
    except UnicodeEncodeError:
        try:
            text = bytes.encode('iso-8859-1')
        except UnicodeEncodeError:
            text = bytes.encode('cp1252')
    return text

This is an excellent website that explains Python's Unicode: http://farmdev.com/talks/unicode

The best 3 tips from it are:

  1. Decode early
  2. Unicode everywhere
  3. Encode late

You'll have to encode the string you post to the IRC server. Also, depending on what feedparser returns, you might want to decode it from a specific encoding.

Encoding depends on what does the feed contain.

latest.title has non-ASCII characters in it.

You must either remove them, escape them or translate them.

The cheap and easy way out is to use repr()

    irc.send('PRIVMSG #pwnmyi :Latest Article: ' + repr(art_name) + '\r\n')

Or better

    irc.send('PRIVMSG #pwnmyi :Latest Article: {0!r}\r\n'.format( art_name ) )

In the long run, you need to address non-ASCII characters in your input.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM