简体   繁体   中英

How do I programmatically create a Unicode string in Python?

I'm writing a spam filter that looks at lists of banned words. I'm trying to create Unicode strings which I can convert to accent-free strings using unidecode.

To make a Unicode string in the REPL, I could type

s= u'ShowTîtžForBłackDIçk'

but how do I do this when I don't know the string in advance? I need to apply the "u" operator programmatically.

I've tried s=unicode(unicodeString)

but this function needs me to state an encoding, and I'm not sure what underlying encoding is being used. I'm using the iPython (Jupyter) notebook, which can render Unicode in its web interface

Open your file as using Python's text reader. You have to define the encoding (it won't guess!):

with io.open("myspamwords.txt", "r", encoding="utf-8") as mywords:
    for line in mywords:
        print line.strip()
        print type(line)

This code will print each line and should show the type as Unicode.

If the results aren't decoded properly, change encoding to the appropriate character encoding.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM