弄亂我的unicode輸出-但是在哪里以及如何進行？

Question

我正在對一些文本文件進行字數統計，並將結果存儲在字典中。 我的問題是，輸出到文件后，這些單詞即使在原始文本中也無法正確顯示。 （我使用TextWrangler來查看它們）。 例如，破折號在原始文件中顯示為破折號，但在輸出中顯示為\\ u2014 ； 在輸出中，單詞也以u為前綴。

問題

我不知道在腳本中的什么位置，時間和方式。

我讀與文件codecs.open()並與它們輸出codecs.open()和json.dump() 它們都以相同的方式出錯。 在這之間，所有要做的就是

標記化
常用表達
收集字典

而且我不知道我把事情搞砸了。 我沒有激活令牌化和其他大多數功能。 所有這些都是在Python 2中發生的。按照先前的建議，我嘗試將所有內容保留在Unicode腳本中。

這是我的工作（省略了無關代碼）：

#read in file, iterating over a list of "fileno"s
with codecs.open(os.path.join(dir,unicode(fileno)+".txt"), "r", "utf-8") as inputfili:
            inputtext=inputfili.read()

#process the text: tokenize, lowercase, remove punctuation and conjugation
content=regular expression to extract text w/out metadata
contentsplit=nltk.tokenize.word_tokenize(content)
text=[i.lower() for i in contentsplit if not re.match(r"\d+", i)]
text= [re.sub(r"('s|s|s's|ed)\b", "", i) for i in text if i not in string.punctuation]

#build the dictionary of word counts
for word in text:
    dicti[word].append(word)

#collect counts for each word, make dictionary of unique words
dicti_nos={unicode(k):len(v) for k,v in dicti.items()}
hapaxdicti= {k:v for k,v in perioddicti_nos.items() if v == 1}

#sort the dictionary
sorteddict=sorted(dictionary.items(), key=lambda x: x[1], reverse=True)

#output the results as .txt and json-file
with codecs.open(file_name, "w", "utf-8") as outputi:
    outputi.write("\n".join([unicode(i) for i in sorteddict]))
with open(file_name+".json", "w") as jsonoutputi:
    json.dump(dictionary, jsonoutputi,  encoding="utf-8")

編輯：解決方案

看來我的主要問題是用錯誤的方式寫入文件。 如果我將代碼更改為以下內容，則說明一切正常。 看起來像是加入了一個（字符串，數字）元組列表，把字符串部分弄亂了； 如果我先加入元組，一切都會正常。

對於json輸出，我必須更改為codecs.open()並將ensure_ascii設置為False 。 顯然，僅將encoding設置為utf-8並不能達到我的想法。

with codecs.open(file_name, "w", "utf-8") as outputi:
    outputi.write("\n".join([":".join([i[0],unicode(i[1])]) for i in sorteddict]))

with codecs.open(file_name+".json", "w", "utf-8") as jsonoutputi:
    json.dump(dictionary, jsonoutputi,  ensure_ascii=False)

謝謝你的幫助！

Answer 1

由於您的示例是部分偽代碼，因此無法運行真正的測試，也無法為您提供已運行且已經過測試的內容，但是通過閱讀您提供的內容，我認為您可能會誤解Unicode在Python 2中的工作方式。

unicode類型（例如通過unicode()或unichr()函數產生的類型）是Unicode字符串的內部表示形式，可用於字符串操作和比較目的。 沒有關聯的編碼。 unicode()函數將緩沖區作為第一個參數，將編碼作為第二個參數，並使用該編碼解釋該緩沖區，以產生一個內部可用的Unicode字符串，此字符串從那時開始不受編碼的約束。

該Unicode字符串並不是要寫到文件中。 所有文件格式都采用某種編碼，並且您應該在將Unicode字符串寫到文件之前再次提供一種編碼。 您在每個地方都擁有unicode(fileno)或unicode(k)或unicode(i)結構的人都懷疑這是因為您所依賴的是默認編碼（可能不是您想要的），並且因為您將繼續將這些值中的大多數直接暴露給文件系統。

處理完這些Unicode字符串后，您可以對它們使用內置方法encode()以及所需的編碼作為參數，以將它們打包為編碼所需的普通字節字符串。

因此，回顧上面的示例，您的inputtext變量是一個普通字符串，其中包含按照UTF-8編碼編碼的數據。 這不是Unicode。 您可以使用inputuni = unicode(inputtext, 'utf-8')之類的操作將其轉換為Unicode字符串，並根據需要對其進行操作，但是對於您正在執行的操作，您甚至可能沒有必要。 如果確實將其轉換為Unicode，則必須對要寫出到文件中的任何Unicode字符串執行與inputuni.encode('UTF-8')等效的inputuni.encode('UTF-8') 。

弄亂我的unicode輸出-但是在哪里以及如何進行？

問題描述

問題

編輯：解決方案

1 個解決方案

解決方案1
1 2016-07-15 19:22:02

弄亂我的unicode輸出-但是在哪里以及如何進行？

問題描述

問題

編輯：解決方案

1 個解決方案

解決方案1 1 2016-07-15 19:22:02

解決方案1
1 2016-07-15 19:22:02