如何在Python中替換Unicode字符？

Question

我通過他們的API提取Twitter數據，其中一條推文有一個特殊字符（右撇號），並且我不斷收到錯誤消息，指出Python無法映射或字符映射該字符。 我已經遍及整個Internet，但尚未找到解決此問題的解決方案。 我只想用Python可以識別的撇號或一個空字符串（基本上將其刪除）替換該字符。 我正在使用Python 3.3。 關於如何解決此問題的任何意見？ 看起來似乎很簡單，但是我是Python的新手。

編輯：這是我用來嘗試濾除引發錯誤的Unicode字符的函數。

@staticmethod
def UnicodeFilter(var):
    temp = var
    temp = temp.replace(chr(2019), "'")
    temp = Functions.ToSQL(temp)
    return temp

另外，在運行程序時，我的錯誤如下。

'charmap'編解碼器無法在位置59編碼字符'\\ u2019'：字符映射為'undefined'

編輯：這是我的源代碼的示例：

import json
import mysql.connector
import unicodedata
from MySQLCL import MySQLCL

class Functions(object):
"""This is a class for Python functions"""

@staticmethod
def Clean(string):
    temp = str(string)
    temp = temp.replace("'", "").replace("(", "").replace(")", "").replace(",", "").strip()
    return temp

@staticmethod
def ParseTweet(string):
    for x in range(0, len(string)):
        tweetid = string[x]["id_str"]
        tweetcreated = string[x]["created_at"]
        tweettext = string[x]["text"]
        tweetsource = string[x]["source"]
        truncated = string[x]["truncated"]
        inreplytostatusid = string[x]["in_reply_to_status_id"]
        inreplytouserid = string[x]["in_reply_to_user_id"]
        inreplytoscreenname = string[x]["in_reply_to_screen_name"]
        geo = string[x]["geo"]
        coordinates = string[x]["coordinates"]
        place = string[x]["place"]
        contributors = string[x]["contributors"]
        isquotestatus = string[x]["is_quote_status"]
        retweetcount = string[x]["retweet_count"]
        favoritecount = string[x]["favorite_count"]
        favorited = string[x]["favorited"]
        retweeted = string[x]["retweeted"]
        possiblysensitive = string[x]["possibly_sensitive"]
        language = string[x]["lang"]

        print(Functions.UnicodeFilter(tweettext))
        #print("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + Functions.UnicodeFilter(tweettext) + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + str(language) + "', '" + Functions.ToSQL(tweetcreated) + "', '" + Functions.ToSQL(tweetsource) + "', " + str(possiblysensitive) + ")")
        #MySQLCL.Set("INSERT INTO tweet(ExTweetID, TweetText, Truncated, InReplyToStatusID, InReplyToUserID, InReplyToScreenName, IsQuoteStatus, RetweetCount, FavoriteCount, Favorited, Retweeted, Language, TweetDate, TweetSource, PossiblySensitive) VALUES (" + str(tweetid) + ", '" + tweettext + "', " + str(truncated) + ", " + Functions.CheckNull(inreplytostatusid) + ", " + Functions.CheckNull(inreplytouserid) + ", '" + Functions.CheckNull(inreplytoscreenname) + "', " + str(isquotestatus) + ", " + str(retweetcount) + ", " + str(favoritecount) + ", " + str(favorited) + ", " + str(retweeted) + ", '" + language + "', '" + tweetcreated + "', '" + str(tweetsource) + "', " + str(possiblysensitive) + ")")

@staticmethod
def ToBool(variable):
    if variable.lower() == 'true':
        return True
    elif variable.lower() == 'false':
        return False

@staticmethod
def CheckNull(var):
    if var == None:
        return ""
    else:
        return var

@staticmethod
def ToSQL(var):
    temp = var
    temp = temp.replace("'", "''")
    return str(temp)

@staticmethod
def UnicodeFilter(var):
    temp = var
    #temp = temp.replace(chr(2019), "'")
    unicodestr = unicode(temp, 'utf-8')
    if unicodestr != temp:
        temp = "'"
    temp = Functions.ToSQL(temp)
    return temp

ekhumoro的回答是正確的。

Answer 1

您的程序似乎存在兩個問題。

首先，您將錯誤的代碼點傳遞給chr() 。 字符的代碼hexdecimal點'是0x2019 ，但你傳遞的十進制數2019 （這相當於0x7e3十六進制）。 因此，您需要執行以下任一操作：

    temp = temp.replace(chr(0x2019), "'") # hexadecimal

要么：

    temp = temp.replace(chr(8217), "'") # decimal

為了正確替換字符。

其次，出現錯誤的原因是因為程序的其他部分（可能是數據庫后端）正在嘗試使用UTF-8以外的其他編碼來編碼unicode字符串。 很難對此進行更精確的描述，因為您沒有在問題中包括完整的追溯。 但是，對“ charmap”的引用表明正在使用Windows代碼頁（而不是cp1252）。 或iso編碼（但不是iso8859-1，又名latin1）； 或可能是KOI8_R。

無論如何，解決此問題的正確方法是確保程序的所有部分（尤其是數據庫）都使用UTF-8。 如果這樣做，您將不必再為替換字符而煩惱了。

Answer 2

您可以對您的unicode字符串進行編碼以轉換為str類型：

 a=u"dataàçççñññ"
type(a)
a.encode('ascii','ignore')

這樣，它將刪除特殊字符將返回“數據”。

您可以使用unicodedata的其他方式

Answer 3

unicode_string = unicode(some_string, 'utf-8')
if unicode_string != some_string:
    some_string = 'whatever you want it to be'

如何在Python中替換Unicode字符？

問題描述

3 個解決方案

解決方案1
2 已采納 2016-03-19 19:44:05

解決方案2
1 2016-03-18 05:07:46

解決方案3
-1 2016-03-18 03:24:56

如何在Python中替換Unicode字符？

問題描述

3 個解決方案

解決方案1 2 已采納 2016-03-19 19:44:05

解決方案2 1 2016-03-18 05:07:46

解決方案3 -1 2016-03-18 03:24:56

解決方案1
2 已采納 2016-03-19 19:44:05

解決方案2
1 2016-03-18 05:07:46

解決方案3
-1 2016-03-18 03:24:56