在CSV文件中解碼UTF8文字

Question

題：

有誰知道我該如何將這b"it\\\\xe2\\\\x80\\\\x99s time to eat" it's time to eat b"it\\\\xe2\\\\x80\\\\x99s time to eat"轉變it's time to eat

更多詳細信息和我的代碼：

大家好，

我目前正在處理一個CSV文件，其中包含滿是UTF8文字的行，例如：

b“吃飯的時間到了\\ xe2 \\ x80 \\ x99s”

最終目標是得到這樣的東西：

該吃飯了

為此，我嘗試使用以下代碼：

import pandas as pd


file_open = pd.read_csv("/Users/Downloads/tweets.csv")

file_open["text"]=file_open["text"].str.replace("b\'", "")

file_open["text"]=file_open["text"].str.encode('ascii').astype(str)

file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]

print(file_open["text"])

運行代碼后，以我作為示例的行輸出為：

是時候吃飯了

我嘗試使用以下代碼打開CSV文件來解決此問題：

file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")

它以以下方式打印示例行：

是時候吃飯了

而且我還嘗試使用此方法解碼行：

file_open["text"]=file_open["text"].str.decode('utf-8')

這給了我以下錯誤：

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

預先非常感謝您的幫助。

Answer 1

b"it\\\\xe2\\\\x80\\\\x99s time to eat"聽起來像您的文件包含轉義的編碼。

通常，您可以使用以下代碼將其轉換為正確的Python3字符串：

x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x)     # it’s time to eat

（在此說明了.encode('latin1')的.encode('latin1') ）

因此，如果在使用pd.read_csv(..., encoding="utf8")仍然轉義了字符串，則可以執行以下操作：

pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
#    itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val)   # it’s time to eat

但是我認為最好對整個文件而不是對每個值分別進行處理，例如使用StringIO（如果文件不太大）：

from io import StringIO

# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
    for line in f:
        line = line.encode('latin1').decode('utf8')
        sio.write(line)
sio.seek(0)    # Reset file pointer to the beginning

# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

在CSV文件中解碼UTF8文字

問題描述

1 個解決方案

解決方案1
2 已采納 2018-06-24 17:22:42

在CSV文件中解碼UTF8文字

問題描述

1 個解決方案

解決方案1 2 已采納 2018-06-24 17:22:42

解決方案1
2 已采納 2018-06-24 17:22:42