简体   繁体   中英

Python Character Encoding?

I'm not sure with the title I'm used, but basically I want to get rid off weird characters from the string. Here is the code:

#!/usr/bin/python
# -*- coding: utf-8 -*-
# source: http://snippets.dzone.com/posts/show/4569

from htmlentitydefs import name2codepoint as n2cp
import re

def substitute_entity(match):
    ent = match.group(3)

    if match.group(1) == "#":
        if match.group(2) == '':
            return unichr(int(ent))
        elif match.group(2) == 'x':
            return unichr(int('0x'+ent, 16))
    else:
        cp = n2cp.get(ent)

        if cp:
           return unichr(cp)
        else:
           return match.group()

def decode_htmlentities(string):
    entity_re = re.compile(r'&(#?)(x?)(\d{1,5}|\w{1,8});')
    return entity_re.subn(substitute_entity, string)[0]



test = ['<b>Blogger</b> in the Classroom - <b>Google</b>', 'Of\xef\xac\x81cial <b>Google   Blog</b>']
container = []

for i in test:
    container.append(decode_htmlentities(i))
print container

for i in test:
    print decode_htmlentities(i)

And here is the result:

['<b>Blogger</b> in the Classroom - <b>Google</b>', 'Of\xef\xac\x81cial <b>Google Blog</b>']

<b>Blogger</b> in the Classroom - <b>Google</b>
Official <b>Google Blog</b>

The Question:

Using the same function (decode_htmlentities()), why I get a different result when appending into a list and 'just' printing?

Here is the difference:

Of\xef\xac\x81cial <b>Google Blog</b> # output from list
Official <b>Google Blog</b> # output from print 

If you add a UTF-8 encoded string to a list, then printing the list will insert \\x escaping. You can get the same result by calling repr on your string. Everything seems to be working correctly.

If you think printing a list should not escape the contents, then you'll have to manually loop through the list. But if you see the correct value when the string is not in the list, then nothing changes when you add it to the list except how it reacts to str .

Perhaps you want to have proper Unicode characters instead of the UTF-8 bytes. So instead of this:

>>> s = '\xef\xac\x81'
>>> print [s]
['\xef\xac\x81']

You can see this:

>>> u = s.decode('utf-8')
>>> print [u]
[u'\ufb01']
>>> len(u)
1
>>> print u
fi

Now you can manipulate the character as an atomic unit, which I hope is what you actually want; additionally, any tool which needs properly encoded characters will know what to do, since you've presented it with characters instead of bytes.

"\\xef\\xac\\x81" is the equivilent to "fi" NOT "fi" . It is a trick to get the letters closer. So if you want to exclude those characters simply:

unicode(someoddstring, errors='ignore').encode('ascii')

or optionally try/except replace unicode characters and do a string replace of that sequence with "fi".

EDIT: The character encoding is working as expected and should be stored that way in the string since "fi" is not an ascii character and will always be represented in an escaped format.

If you simply want to make the list print out in the [ string, string ] format:

print "[",
for i in oddList:
  print i + ",",
print "]"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM